Skip to content

HBASE-29364 Fix FAILED_OPEN state during master restart#8427

Open
SwaraliJoshi wants to merge 4 commits into
apache:masterfrom
SwaraliJoshi:HBASE-29364
Open

HBASE-29364 Fix FAILED_OPEN state during master restart#8427
SwaraliJoshi wants to merge 4 commits into
apache:masterfrom
SwaraliJoshi:HBASE-29364

Conversation

@SwaraliJoshi

Copy link
Copy Markdown
Contributor

@SwaraliJoshi SwaraliJoshi changed the title Hbase 29364 Fix FAILED_OPEN state during master restart HBASE-29364 Fix FAILED_OPEN state during master restart Jun 28, 2026
*/
@Tag(MasterTests.TAG)
@Tag(MediumTests.TAG)
public class TestOpenRegionProcedure extends TestAssignmentManagerBase {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to reproduce the problem in a real mini cluster with a real TRSP?
We need to confirm that this could happen in real production.

@SwaraliJoshi SwaraliJoshi Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did have a test to confirm this issue before fixing it. Here is the code :

  1. Mini Cluster test -
    TestFailedOpenRestoredAsOpenOnFailover.java

  2. Replication of exact bug manually -
    TestOpenRegionProcedureRestoreFailedOpen.java

These two assert the pre-fix (buggy) behavior — that a FAILED_OPEN ends up OPEN. With the fix currently in place, they will fail. They're bug-demonstration artifacts, not regression guards.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include them in this PR too.

@SwaraliJoshi SwaraliJoshi Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But these tests will fail. Or should I instead have assertEquals(State.OPENING, regionState(newMaster, hri)); ?

Update : I have added the test to assert the state is marked as OPENING

@SwaraliJoshi SwaraliJoshi requested a review from Apache9 June 29, 2026 06:47
@virajjasani

Copy link
Copy Markdown
Contributor

Build failure:

Banned imports detected:
Reason: Use junit5 instead
	in file: org/apache/hadoop/hbase/master/assignment/TestOpenRegionProcedureRestoreFailedOpen.java
		static org.junit.Assert.assertNotEquals (Line: 20, Matched by: static org.junit.Assert.**)


// The region must NOT be OPEN at this point - the open failed.
State beforeFailover = regionState(cluster.getMaster(), hri);
LOG.info("State before failover (active master): {}", beforeFailover);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this log, we can assert the region state with expected value

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@virajjasani Updated sir, could you please check now?

@SwaraliJoshi SwaraliJoshi requested a review from virajjasani June 30, 2026 07:44
HMaster newMaster = cluster.getMaster();
LOG.info("New active master: {}", newMaster.getServerName());

UTIL.waitFor(60_000, 500, () -> regionState(newMaster, hri) == State.OPENING);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the real harm here is that, the region is recorded as OPEN on the target regionserver but actually it is not? So the end to end test should make sure that the region is finally online and we can read/write it? Or we get the region location in meta, and then check with the region server directly about whether the region is on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants