Skip to content

fix(test): de-flake TestCleanDeletesOldestFiles#2446

Open
ValentaTomas wants to merge 1 commit intomainfrom
fix/flaky-clean-deletes-oldest-files
Open

fix(test): de-flake TestCleanDeletesOldestFiles#2446
ValentaTomas wants to merge 1 commit intomainfrom
fix/flaky-clean-deletes-oldest-files

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented Apr 18, 2026

Summary

TestCleanDeletesOldestFiles in packages/orchestrator/cmd/clean-nfs-cache/cleaner is flaky. A recent example failure on PR #2394:

Expect \"subA/file5.txt\" to match \".+/file(6|7|8)\\.txt\"

The existing comment on the assertion already hints at the symptom ("Because of concurrency, sometimes we may pick an extra batch... so include 6 as well") — but the regex still leaves room for further drift (file5, file4, ...).

Root cause

With BatchN=4 / DeleteN=2, splitBatch reinserts the "younger" half of each batch back into its parent subdir. There's a real race here:

  1. Scanner pops the 4 candidates of batch 1, then pops a 5th (e.g. file6_A) and blocks on candidateCh while main is still processing batch 1.
  2. Main runs splitBatch and reinsertCandidates, putting file7_A back into subA (now the "oldest" entry there).
  3. file6_A (already popped) ends up in batch 2 and gets deleted.
  4. file7_A is now sitting in subA's in-memory Files but the scanner happens to pick subB next, so file7_A is never popped again — and never deleted from disk.

End state: per subA, file6 is deleted but file7 is not — a gap in what we expect to be an oldest-first deletion suffix.

This was reproducible locally with go test -count=2000 -race (a few failures per 16k runs).

Fix

  • Set BatchN == DeleteN so splitBatch never reinserts anything in this test, eliminating the cross-batch race. (splitBatch's reinsert logic is still covered separately by TestSplitBatch.)
  • Replace the brittle filename regex check with the actual algorithmic invariant: per subdir, the deleted indices form a contiguous suffix [N..8].

Diff is contained to a single test file (no production code changes).

Test plan

  • go test -race ./cmd/clean-nfs-cache/cleaner/ — pass
  • go test -count=2000 -race -run TestCleanDeletesOldestFiles ./cmd/clean-nfs-cache/cleaner/ — pass, repeated 8x (16,000 runs total) with no failures
  • Verified the previous test code reliably reproduces the flake under the same stress harness

The test failed intermittently in CI with messages like
"Expect subA/file5.txt to match file(6|7|8)\.txt".

Root cause is a race between the scanner and the main loop. With
BatchN=4 and DeleteN=2, the scanner can pop a 5th candidate (e.g.
file6_A) before the main loop reinserts the "younger" half of the
previous batch (e.g. file7_A) back into the same subdir. The
reinserted file7_A then never gets popped again, while file6_A goes
on to be deleted in the next batch — producing a "gap" in the
deleted suffix that the regex assertion didn't allow for.

Set BatchN == DeleteN so splitBatch reinserts nothing, eliminating
the cross-batch race. Replace the brittle filename regex with the
real algorithmic invariant the cleaner now guarantees: per subdir
the deleted indices form a contiguous suffix [N..8].

splitBatch's reinsert behavior is still covered by TestSplitBatch.
@cursor
Copy link
Copy Markdown

cursor bot commented Apr 18, 2026

PR Summary

Low Risk
Low risk: test-only changes that tighten assertions and adjust test parameters; no production logic is modified.

Overview
De-flakes TestCleanDeletesOldestFiles by setting BatchN to match DeleteN to avoid splitBatch/reinsert timing races, and replaces the filename-regex assertion with a per-subdir invariant check that deletions form a contiguous oldest-first suffix while still only asserting a minimum delete count (allowing extra deletes due to concurrency).

Reviewed by Cursor Bugbot for commit c21f81b. Bugbot is set up for automated code reviews on this repo. Configure here.

@ValentaTomas ValentaTomas assigned levb and unassigned dobrac Apr 18, 2026
@ValentaTomas ValentaTomas requested a review from levb April 18, 2026 22:08
@ValentaTomas ValentaTomas marked this pull request as ready for review April 18, 2026 22:15
@ValentaTomas ValentaTomas enabled auto-merge (squash) April 18, 2026 22:15
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean test-only fix with a well-reasoned root cause and a stronger invariant assertion.

Extended reasoning...

Overview\nSingle test file change to de-flake TestCleanDeletesOldestFiles. Sets BatchN == DeleteN to prevent splitBatch from reinserting candidates (eliminating the documented race), and replaces the brittle filename regex with a per-subdir contiguous-suffix invariant check.\n\n### Security risks\nNone — test-only change with no production code modifications.\n\n### Level of scrutiny\nLow. The root cause is well-explained, the fix directly addresses it, and the new assertion is semantically stronger than what it replaces. TestSplitBatch continues to cover the reinsert logic independently.\n\n### Other factors\nNo bugs found. Thoroughly stress-tested (16,000 runs with -race). No outstanding reviewer comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants