HBASE-29364 Fix FAILED_OPEN state during master restart#8427
HBASE-29364 Fix FAILED_OPEN state during master restart#8427SwaraliJoshi wants to merge 4 commits into
Conversation
| */ | ||
| @Tag(MasterTests.TAG) | ||
| @Tag(MediumTests.TAG) | ||
| public class TestOpenRegionProcedure extends TestAssignmentManagerBase { |
There was a problem hiding this comment.
Is it possible to reproduce the problem in a real mini cluster with a real TRSP?
We need to confirm that this could happen in real production.
There was a problem hiding this comment.
Yes, I did have a test to confirm this issue before fixing it. Here is the code :
-
Mini Cluster test -
TestFailedOpenRestoredAsOpenOnFailover.java -
Replication of exact bug manually -
TestOpenRegionProcedureRestoreFailedOpen.java
These two assert the pre-fix (buggy) behavior — that a FAILED_OPEN ends up OPEN. With the fix currently in place, they will fail. They're bug-demonstration artifacts, not regression guards.
There was a problem hiding this comment.
Let's include them in this PR too.
There was a problem hiding this comment.
But these tests will fail. Or should I instead have assertEquals(State.OPENING, regionState(newMaster, hri)); ?
Update : I have added the test to assert the state is marked as OPENING
|
Build failure: |
|
|
||
| // The region must NOT be OPEN at this point - the open failed. | ||
| State beforeFailover = regionState(cluster.getMaster(), hri); | ||
| LOG.info("State before failover (active master): {}", beforeFailover); |
There was a problem hiding this comment.
Instead of this log, we can assert the region state with expected value
There was a problem hiding this comment.
@virajjasani Updated sir, could you please check now?
| HMaster newMaster = cluster.getMaster(); | ||
| LOG.info("New active master: {}", newMaster.getServerName()); | ||
|
|
||
| UTIL.waitFor(60_000, 500, () -> regionState(newMaster, hri) == State.OPENING); |
There was a problem hiding this comment.
I think the real harm here is that, the region is recorded as OPEN on the target regionserver but actually it is not? So the end to end test should make sure that the region is finally online and we can read/write it? Or we get the region location in meta, and then check with the region server directly about whether the region is on it.
https://issues.apache.org/jira/browse/HBASE-29364