Skip to content

HDDS-14793. Fix race condition in XceiverClientGrpc#connectToDatanode causing intermittent NPEs.#9997

Draft
ptlrs wants to merge 5 commits intoapache:masterfrom
ptlrs:HDDS-14793-Intermittent-NPE-in-XceiverClientGrpc
Draft

HDDS-14793. Fix race condition in XceiverClientGrpc#connectToDatanode causing intermittent NPEs.#9997
ptlrs wants to merge 5 commits intoapache:masterfrom
ptlrs:HDDS-14793-Intermittent-NPE-in-XceiverClientGrpc

Conversation

@ptlrs
Copy link
Copy Markdown
Contributor

@ptlrs ptlrs commented Mar 29, 2026

What changes were proposed in this pull request?

The XceiverClientGrpc#connectToDatanode intermittently fails with an NPE.
The problem is that for a given datanode, there is a race condition between creating a channel and creating a stub.

When a new channel is created for a DN, it is put into the channels map. However, presence of a channel in the map does not imply that the corresponding stub for the same DN also exists in the asyncStubs map.

If the stub is accessed after creating a channel but before the creation of stub, we can get an NPE.

This PR fixes the problem by:

  • maintaining only one dnChannelInfoMap for both the channels and stubs instead of two independent maps
  • creating a ChannelInfo class to group the channel and stub

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14793

How was this patch tested?

CI: https://github.com/ptlrs/ozone/actions/runs/23703558972
Flaky test runner: https://github.com/ptlrs/ozone/actions/runs/23823575690

Copy link
Copy Markdown
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ptlrs for working on this.

Comment on lines +177 to +183
/**
* Checks if the client has a live connection channel to the specified Datanode.
*
* @return True if the connection is alive, false otherwise.
*/
@VisibleForTesting
public boolean isConnected(DatanodeDetails details) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please don't move isConnected unnecessarily, it inflates the diff

Comment on lines +162 to +175
private void connectToDatanode(DatanodeDetails dn) throws IOException {
if (isClosed.get()) {
throw new IOException("Client is closed.");
}

if (isConnected(dn)) {
return;
}
// read port from the data node, on failure use default configured port
int port = dn.getStandalonePort().getValue();
if (port == 0) {
port = config.getInt(OzoneConfigKeys.HDDS_CONTAINER_IPC_PORT,
OzoneConfigKeys.HDDS_CONTAINER_IPC_PORT_DEFAULT);
}
final int finalPort = port;

LOG.debug("Connecting to server : {}; nodes in pipeline : {}, ", dn, pipeline.getNodes());
LOG.debug("Connecting to server: {}; nodes in pipeline: {}", dn, pipeline.getNodes());

channels.computeIfPresent(dn.getID(), (dnId, channel) -> {
if (channel.isTerminated() || channel.isShutdown()) {
asyncStubs.remove(dnId);
return null; // removes from channels map
}
removeStaleChannel(dn);
generateNewChannel(dn);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connectToDatanode still performs multiple map operations in a non-atomic way:

  • calls isConnected twice, once directly, once via removeStaleChannel
  • isConnected calls containsKey and get separately
  • removeStaleChannel calls isConnected and remove separately
  • generateNewChannel calls unconditional put

Use only one compute operation, and distinguish between the three cases in that call:

  • absent
  • present but stale
  • present and active

Comment on lines +780 to +781
private ManagedChannel channel;
private XceiverClientProtocolServiceStub stub;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these can be final

@ptlrs
Copy link
Copy Markdown
Contributor Author

ptlrs commented Mar 29, 2026

Thanks for the review @adoroszlai, I have pushed the fixes.

@ptlrs ptlrs requested a review from adoroszlai March 30, 2026 16:28
return null; // removes from channels map
dnChannelInfoMap.compute(dn.getID(), (dnId, channelInfo) -> {
// channel is absent or stale
if (channelInfo == null || channelInfo.isChannelInactive()) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to close inactive channel?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the channel is inactive, it is already either terminated or shutdown. In both cases the underlying resources have already been released.

Copy link
Copy Markdown
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ptlrs for updating the patch. compute logic looks good now.

Comment on lines +239 to +243
if (details == null || !dnChannelInfoMap.containsKey(details.getID())) {
return false;
}

private boolean isConnected(ManagedChannel channel) {
return channel != null && !channel.isTerminated() && !channel.isShutdown();
return !dnChannelInfoMap.get(details.getID()).isChannelInactive();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid separate containsKey and get.

if (details == null) {
  return false;
}
ChannelInfo info = dnChannelInfoMap.get(details.getID());
return info != null && !info.isChannelInactive();

Comment on lines +266 to +267
.map(ChannelInfo::getChannel)
.collect(Collectors.toList());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChannelInfo.channel may be null (or at least there are checks for that elsewhere).

Suggested change
.map(ChannelInfo::getChannel)
.collect(Collectors.toList());
.map(ChannelInfo::getChannel)
.filter(Objects::nonNull)
.collect(Collectors.toList());

Comment on lines 735 to 754
throw new IOException("This channel is not connected.");
}

ManagedChannel channel = channels.get(dn.getID());
// If the channel doesn't exist for this specific datanode or the channel
// is closed, just reconnect
if (!isConnected(channel)) {
// If the channel doesn't exist for this specific datanode or the channel is closed, just reconnect
if (!isConnected(dn)) {
reconnect(dn);
}

}

private void reconnect(DatanodeDetails dn)
throws IOException {
ManagedChannel channel;
try {
connectToDatanode(dn);
channel = channels.get(dn.getID());
} catch (Exception e) {
throw new IOException("Error while connecting", e);
}

if (!isConnected(channel)) {
if (!isConnected(dn)) {
throw new IOException("This channel is not connected.");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that connectToDatanode handles all cases ("closed", "already connected", "needs new connection"), checkOpen can be simplified:

    connectToDatanode(dn);

    if (!isConnected(dn)) {
      throw new IOException("This channel is not connected.");
    }

and reconnect can be removed.

Comment on lines -201 to -203
} catch (RuntimeException e) {
LOG.error("Failed to create channel to datanode {}", dn, e);
throw new IOException(e.getCause());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep the translation back from unchecked exception to IOException, because callers may not handle the former.

try {
  dnChannelInfoMap.compute(...);
} catch (UncheckedIOException e) {
  LOG.error("Failed to create channel to datanode {}", dn, e);
  throw e.getCause();
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants