Skip to content

Fix #16344: graceful HTTP/2 GOAWAY migration to eliminate request failure window#16345

Open
devjue wants to merge 6 commits into
apache:3.3from
devjue:fix/16344-graceful-goaway-migration
Open

Fix #16344: graceful HTTP/2 GOAWAY migration to eliminate request failure window#16345
devjue wants to merge 6 commits into
apache:3.3from
devjue:fix/16344-graceful-goaway-migration

Conversation

@devjue

@devjue devjue commented Jun 16, 2026

Copy link
Copy Markdown

What is the purpose of the change

Fixes #16344.

Eliminate the single-request failure window that occurs every time a Dubbo Triple Consumer receives an HTTP/2 GOAWAY(errorCode=0, lastStreamId=MAX_INT) frame from gateways / reverse proxies / providers that gracefully rotate connections (e.g. max_requests_per_connection, idle_timeout, hot-restart drain).

Previously, AbstractNettyConnectionClient#onGoaway nulled the channel reference immediately and NettyConnectionHandler#onGoAway then scheduled a reconnect, leaving a brief window in which AbstractClusterInvoker#checkInvokers throws RpcException before FailoverCluster's retry loop — so retries=N did NOT mitigate the issue. This PR implements graceful migration: keep the old channel serving until a new channel is ready, then atomically swap.

Brief changelog

  • NettyConnectionHandler#onGoAway: keep old channel alive; schedule attemptGracefulMigration on connectivityExecutor after GRACEFUL_RECONNECT_DELAY_MS = 200ms; one retry on failure, then fall back to AbstractNettyConnectionClient#onGoaway.
  • NettyConnectionClient#initBootstrap: closeFuture listener now calls compareAndClearNettyChannel(ch) instead of clearNettyChannel() — prevents the old channel's close listener from wiping a freshly swapped-in new channel.
  • AbstractNettyConnectionClient: add compareAndClearNettyChannel, plus package-private getConnectivityExecutor() / getReconnectDuration() accessors.
  • NettyChannel#getChannelIfPresent: lookup-only API used by channelInactive to avoid allocating a transient NettyChannel purely for logging.
  • TripleGoAwayHandler: log errorCode and lastStreamId so the GOAWAY trigger is observable.

Verifying this change

Validated against a local Triple demo with sustained 1 QPS traffic:

  • 26 GOAWAY frames received → 100% graceful migration success → zero No provider available errors and zero request failures.
  • Channel rotation chain (first 18 migrations):
    :45956 → :45110 → :53222 → :58366 → :56372 → :55436 → :55188 → :52926 → :50682 → :48682 → :47626 → :44014 → :43116 → :45044 → :48602 → :51880 → :50992 → :51672
  • No regression in existing dubbo-remoting-netty4 and dubbo-rpc-triple unit tests.

Does this pull request potentially affect one of the following parts

  • Dependencies (does it add or upgrade a dependency)
  • The public API
  • The runtime per-connection behavior (HTTP/2 GOAWAY handling on the Consumer side)
  • The persistence format of the configurations
  • The default values of configurations
  • The threading model (uses existing connectivityExecutor; does NOT introduce new threads)
  • The serialization protocol
  • The compatibility with previous versions

Backward-compatible: defaults are unchanged; the only observable difference is that GOAWAY frames no longer cause a request failure window.

Documentation

  • Does this pull request introduce a new feature? No (bug fix)
  • If yes, how is the feature documented? Not applicable

Happy to backport to 3.2 once this lands on 3.3.

…st failure window

Keep the old channel alive after a GOAWAY frame and bring up a new
connection asynchronously on the connectivity executor; atomically
swap the channel reference via a closeFuture CAS so the old channel's
close listener cannot wipe a freshly swapped-in new channel.

- NettyConnectionHandler: schedule attemptGracefulMigration on
  connectivityExecutor with GRACEFUL_RECONNECT_DELAY_MS=200ms; one
  retry on failure, then fall back to AbstractNettyConnectionClient#onGoaway.
- NettyConnectionClient#initBootstrap: closeFuture listener now
  compareAndClearNettyChannel(ch) instead of clearNettyChannel().
- AbstractNettyConnectionClient: add compareAndClearNettyChannel,
  plus package-private getConnectivityExecutor / getReconnectDuration.
- NettyChannel#getChannelIfPresent: lookup-only API used by
  channelInactive.
- TripleGoAwayHandler: log errorCode and lastStreamId.

Signed-off-by: Jue Zhang <devjue@gmail.com>
@devjue devjue force-pushed the fix/16344-graceful-goaway-migration branch from b111c6d to 02e4fa9 Compare June 16, 2026 06:36
@uuuyuqi

uuuyuqi commented Jun 16, 2026

Copy link
Copy Markdown

@chickenlj PTAL

@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.66667% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.84%. Comparing base (ed0e11e) to head (f0170a9).

Files with missing lines Patch % Lines
...oting/transport/netty4/NettyConnectionHandler.java 59.25% 9 Missing and 2 partials ⚠️
.../dubbo/remoting/transport/netty4/NettyChannel.java 33.33% 1 Missing and 1 partial ⚠️
...ransport/netty4/AbstractNettyConnectionClient.java 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##                3.3   #16345      +/-   ##
============================================
- Coverage     60.86%   60.84%   -0.02%     
+ Complexity    11768    11767       -1     
============================================
  Files          1953     1953              
  Lines         89260    89291      +31     
  Branches      13471    13472       +1     
============================================
+ Hits          54329    54333       +4     
- Misses        29350    29367      +17     
- Partials       5581     5591      +10     
Flag Coverage Δ
integration-tests-java21 32.06% <40.47%> (-0.03%) ⬇️
integration-tests-java8 32.18% <40.47%> (-0.03%) ⬇️
samples-tests-java21 32.15% <14.28%> (+<0.01%) ⬆️
samples-tests-java8 29.71% <11.90%> (-0.13%) ⬇️
unit-tests-java11 59.07% <26.19%> (-0.02%) ⬇️
unit-tests-java17 58.59% <61.90%> (+0.01%) ⬆️
unit-tests-java21 58.55% <26.19%> (-0.01%) ⬇️
unit-tests-java25 58.52% <61.90%> (-0.02%) ⬇️
unit-tests-java8 59.07% <26.19%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zrlw

zrlw commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

We would appreciate if you add testcases to https://github.com/apache/dubbo-samples for this PR.

@LI123456mo LI123456mo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zrlw , have been going through the PR , seen the changes made , tried to run the changes locally,test the changes, and it worked,

image
@Test
void testGoawayCausesNoChannelFailure() throws Throwable {
    final AbstractConnectionClient connectionClient =
            connectionManager.connect(url, new DefaultPuHandler());
    Assertions.assertTrue(
            connectionClient.isAvailable(),
            "Precondition: connection should be available before GOAWAY"
    );
    io.netty.channel.Channel channelBeforeGoaway = connectionClient.getChannel(true);
    NettyConnectionHandler handler = new NettyConnectionHandler(
            (AbstractNettyConnectionClient) connectionClient);
    handler.onGoAway(channelBeforeGoaway);
    io.netty.channel.Channel channelAfterGoaway = connectionClient.getChannel(true);
    Assertions.assertNotNull(
            channelAfterGoaway,
            "Channel should NOT be null immediately after GOAWAY " +
                    "— old channel should stay alive until new one is ready"
    );
    await().atMost(Duration.ofSeconds(10))
            .until(() -> connectionClient.isAvailable());
    Assertions.assertTrue(
            connectionClient.isAvailable(),
            "Connection should be available after GOAWAY migration completes"
    );
    connectionClient.close();
}

}

LI123456mo

This comment was marked as duplicate.

@zrlw

zrlw commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@test
void testGoawayCausesNoChannelFailure() throws Throwable {
final AbstractConnectionClient connectionClient =
connectionManager.connect(url, new DefaultPuHandler());
Assertions.assertTrue(
connectionClient.isAvailable(),
"Precondition: connection should be available before GOAWAY"
);
io.netty.channel.Channel channelBeforeGoaway = connectionClient.getChannel(true);
NettyConnectionHandler handler = new NettyConnectionHandler(
(AbstractNettyConnectionClient) connectionClient);
handler.onGoAway(channelBeforeGoaway);
io.netty.channel.Channel channelAfterGoaway = connectionClient.getChannel(true);
Assertions.assertNotNull(
channelAfterGoaway,
"Channel should NOT be null immediately after GOAWAY " +
"— old channel should stay alive until new one is ready"
);
await().atMost(Duration.ofSeconds(10))
.until(() -> connectionClient.isAvailable());
Assertions.assertTrue(
connectionClient.isAvailable(),
"Connection should be available after GOAWAY migration completes"
);
connectionClient.close();
}

Created TripleGoAwayTest.java based on these codes.

@LI123456mo LI123456mo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zrlw zrlw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zrlw zrlw added type/bug Bugs to being fixed and removed bug labels Jun 25, 2026
@EarthChen EarthChen requested a review from Copilot June 25, 2026 05:06

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes Dubbo Triple (HTTP/2) Consumer-side GOAWAY handling to avoid a transient “no channel/provider available” window by keeping the old connection usable while establishing a replacement connection, then swapping the active channel safely.

Changes:

  • Implement delayed, connectivity-executor-based graceful reconnection on GOAWAY (instead of immediately nulling the channel reference).
  • Prevent old channel closeFuture listeners from clearing a newly swapped-in channel via CAS-based compareAndClearNettyChannel.
  • Add diagnostics (GOAWAY errorCode/lastStreamId logging) and a new unit test for the “channel should not become null immediately after GOAWAY” behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
dubbo-rpc/dubbo-rpc-triple/src/main/java/org/apache/dubbo/rpc/protocol/tri/transport/TripleGoAwayHandler.java Adds GOAWAY frame diagnostics and triggers connection migration on GOAWAY.
dubbo-remoting/dubbo-remoting-netty4/src/test/java/org/apache/dubbo/remoting/transport/netty4/TripleGoAwayTest.java New test to validate no immediate channel-null window after GOAWAY.
dubbo-remoting/dubbo-remoting-netty4/src/main/java/org/apache/dubbo/remoting/transport/netty4/NettyConnectionHandler.java Implements delayed graceful migration scheduling and avoids reconnect-on-inactive for GOAWAY-marked channels.
dubbo-remoting/dubbo-remoting-netty4/src/main/java/org/apache/dubbo/remoting/transport/netty4/NettyConnectionClient.java Updates close listener to CAS-clear only the expected channel reference.
dubbo-remoting/dubbo-remoting-netty4/src/main/java/org/apache/dubbo/remoting/transport/netty4/NettyChannel.java Adds cache lookup-only helper to avoid transient NettyChannel allocation on inactive channels.
dubbo-remoting/dubbo-remoting-netty4/src/main/java/org/apache/dubbo/remoting/transport/netty4/AbstractNettyConnectionClient.java Adds CAS clear helper/accessors; tweaks logs; keeps old->new swap safe under concurrent close.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 44 to 46
if (msg instanceof Http2GoAwayFrame) {
Http2GoAwayFrame goAwayFrame = (Http2GoAwayFrame) msg;
final ConnectionHandler connectionHandler =
Comment on lines +28 to +30
import org.apache.dubbo.rpc.model.ApplicationModel;
import org.apache.dubbo.rpc.model.ModuleModel;

Comment on lines +43 to +47
private static URL url;

private static NettyPortUnificationServer server;

private static ConnectionManager connectionManager;
Comment on lines +51 to +54
int port = NetUtils.getAvailablePort();
url = URL.valueOf("tri://127.0.0.1:" + port + "?foo=bar");
ApplicationModel applicationModel = ApplicationModel.defaultModel();
ApplicationConfig applicationConfig = new ApplicationConfig("provider-app");
Comment on lines +72 to +78
public static void close() {
try {
server.close();
} catch (Throwable e) {
// ignored
}
}
Comment on lines 232 to 234
if (logger.isDebugEnabled()) {
logger.debug("Connection:{} connected", this);
logger.debug("Connection:{} connected, graceful migration complete", this);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Single-request failure window when receiving HTTP/2 GOAWAY due to immediate channel nullification

6 participants