fix(test): de-flake TestCleanDeletesOldestFiles#2446
Open
ValentaTomas wants to merge 1 commit intomainfrom
Open
fix(test): de-flake TestCleanDeletesOldestFiles#2446ValentaTomas wants to merge 1 commit intomainfrom
ValentaTomas wants to merge 1 commit intomainfrom
Conversation
The test failed intermittently in CI with messages like "Expect subA/file5.txt to match file(6|7|8)\.txt". Root cause is a race between the scanner and the main loop. With BatchN=4 and DeleteN=2, the scanner can pop a 5th candidate (e.g. file6_A) before the main loop reinserts the "younger" half of the previous batch (e.g. file7_A) back into the same subdir. The reinserted file7_A then never gets popped again, while file6_A goes on to be deleted in the next batch — producing a "gap" in the deleted suffix that the regex assertion didn't allow for. Set BatchN == DeleteN so splitBatch reinserts nothing, eliminating the cross-batch race. Replace the brittle filename regex with the real algorithmic invariant the cleaner now guarantees: per subdir the deleted indices form a contiguous suffix [N..8]. splitBatch's reinsert behavior is still covered by TestSplitBatch.
PR SummaryLow Risk Overview Reviewed by Cursor Bugbot for commit c21f81b. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
LGTM — clean test-only fix with a well-reasoned root cause and a stronger invariant assertion.
Extended reasoning...
Overview\nSingle test file change to de-flake TestCleanDeletesOldestFiles. Sets BatchN == DeleteN to prevent splitBatch from reinserting candidates (eliminating the documented race), and replaces the brittle filename regex with a per-subdir contiguous-suffix invariant check.\n\n### Security risks\nNone — test-only change with no production code modifications.\n\n### Level of scrutiny\nLow. The root cause is well-explained, the fix directly addresses it, and the new assertion is semantically stronger than what it replaces. TestSplitBatch continues to cover the reinsert logic independently.\n\n### Other factors\nNo bugs found. Thoroughly stress-tested (16,000 runs with -race). No outstanding reviewer comments.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
TestCleanDeletesOldestFilesinpackages/orchestrator/cmd/clean-nfs-cache/cleaneris flaky. A recent example failure on PR #2394:The existing comment on the assertion already hints at the symptom ("Because of concurrency, sometimes we may pick an extra batch... so include 6 as well") — but the regex still leaves room for further drift (file5, file4, ...).
Root cause
With
BatchN=4/DeleteN=2,splitBatchreinserts the "younger" half of each batch back into its parent subdir. There's a real race here:file6_A) and blocks oncandidateChwhile main is still processing batch 1.splitBatchandreinsertCandidates, puttingfile7_Aback intosubA(now the "oldest" entry there).file6_A(already popped) ends up in batch 2 and gets deleted.file7_Ais now sitting insubA's in-memoryFilesbut the scanner happens to picksubBnext, sofile7_Ais never popped again — and never deleted from disk.End state: per
subA,file6is deleted butfile7is not — a gap in what we expect to be an oldest-first deletion suffix.This was reproducible locally with
go test -count=2000 -race(a few failures per 16k runs).Fix
BatchN == DeleteNsosplitBatchnever reinserts anything in this test, eliminating the cross-batch race. (splitBatch's reinsert logic is still covered separately byTestSplitBatch.)[N..8].Diff is contained to a single test file (no production code changes).
Test plan
go test -race ./cmd/clean-nfs-cache/cleaner/— passgo test -count=2000 -race -run TestCleanDeletesOldestFiles ./cmd/clean-nfs-cache/cleaner/— pass, repeated 8x (16,000 runs total) with no failures