Skip to content

MOD-14618 Fix race condition when calling indexLabelCount on SVS without locks#919

Merged
meiravgri merged 4 commits intomainfrom
joan-fix-flaky-test
Mar 25, 2026
Merged

MOD-14618 Fix race condition when calling indexLabelCount on SVS without locks#919
meiravgri merged 4 commits intomainfrom
joan-fix-flaky-test

Conversation

@JoanFM
Copy link
Contributor

@JoanFM JoanFM commented Mar 19, 2026

Fixes a race condition in VecSimTieredIndex::debugInfoIterator() that can crash when called while background indexing is running.

Root Cause

debugInfoIterator() accesses both sub-indexes without holding any locks. For SVS, debugInfoIterator() internally calls indexLabelCount(), which reads internal translation tables. These tables can be concurrently modified by background threads, causing a crash.

VecSimTieredIndex::debugInfoIterator()
  → backendIndex->debugInfoIterator()      ← NO LOCK HELD
    → SVSIndex::debugInfoIterator()
      → SVSIndex::debugInfo()
        → SVSIndex::indexLabelCount()      ← reads translation tables
          → CRASH: tables appear inconsistent due to concurrent modification

Fix

Hold the appropriate shared lock when accessing each sub-index's debug info:

  • Hold flatIndexGuard when calling frontendIndex->debugInfoIterator()
  • Hold mainIndexGuard when calling backendIndex->debugInfoIterator()

Each lock is released as soon as the respective iterator is created, minimizing contention.

Follow-up: Once Intel releases a new version of SVS that includes intel/ScalableVectorSearch#301, we need a follow-up PR to upgrade SVS and compile with SVS_EXPERIMENTAL_CHECK_BOUNDS=OFF.

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

Note

Low Risk
Low risk concurrency fix that only adds shared-lock guards around frontendIndex/backendIndex debug-info access; main concern is potential for minor lock contention in debug/info paths.

Overview
Prevents a race when retrieving tiered index debug info during background indexing by acquiring flatIndexGuard and mainIndexGuard (shared) while calling frontendIndex->debugInfoIterator() and backendIndex->debugInfoIterator().

Locks are scoped independently per sub-index to keep contention minimal while ensuring thread-safe access to backend SVS debug paths that may touch mutable translation tables.

Written by Cursor Bugbot for commit 56d0b33. This will update automatically on new commits. Configure here.

@JoanFM JoanFM changed the title proposal to fix flaky test Fix race condition in VecSimTieredIndex::debugInfoIterator during background indexing Mar 19, 2026
@jit-ci
Copy link

jit-ci bot commented Mar 19, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

@JoanFM JoanFM requested review from GuyAv46 and meiravgri and removed request for GuyAv46 March 19, 2026 16:33
@codecov
Copy link

codecov bot commented Mar 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.98%. Comparing base (3dd2dc2) to head (56d0b33).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #919   +/-   ##
=======================================
  Coverage   96.98%   96.98%           
=======================================
  Files         129      129           
  Lines        7567     7574    +7     
=======================================
+ Hits         7339     7346    +7     
  Misses        228      228           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@meiravgri meiravgri changed the title Fix race condition in VecSimTieredIndex::debugInfoIterator during background indexing Fix race condition when calling indexLabelCount on SVS without locks Mar 23, 2026
@meiravgri meiravgri changed the title Fix race condition when calling indexLabelCount on SVS without locks MOD-14618 Fix race condition when calling indexLabelCount on SVS without locks Mar 23, 2026
@meiravgri meiravgri requested a review from alonre24 March 23, 2026 10:47
@alonre24
Copy link
Collaborator

  1. Regarding protecting indexLabelCount when it is called from VecSimTieredIndex::debugInfoIterator - we should add the shared lock protection - we should not worry about perf. here.
  2. Regarding the Heuristics - what is the suggested solution if not to acquire the shared lock? Read the index size instead and make it atomic?

@meiravgri @GuyAv46

@meiravgri
Copy link
Collaborator

@alonre24 Reading the index size instead of label count changes the heuristic as it includes marked deleted vectors.

  1. We can wrapp only the label count call. not sure it makes that much of a difference
  2. The exception is thrown only when a specific compilation option is set
    https://github.com/intel/ScalableVectorSearch/blob/aa272e270d198106a6b595632db3ad517ebbc773/include/svs/lib/boundscheck.h#L28
    This option is enabled when the build type is not Release. We can ask Intel to introduce an option to disable it from an external source. It only requires changes in the public repo. I already confirmed with them that the binaries are built with the option disabled.

@GuyAv46
Copy link
Collaborator

GuyAv46 commented Mar 24, 2026

We can take this fix.
IMO, both size and label count API should become safe. We can track both size and label count (atomic var we couple to the map size every time we change it)

@alonre24
Copy link
Collaborator

@meiravgri - we can disable SVS_CHECK_BOUNDS upon building from svs source, but I'm worried about the unsafe access itself, not just the exception. So I guess we should protect index size and label count, either by locking or by using atomic vars as @GuyAv46 mentioned (whatever is simpler to implement) - I don't think it should bring any regression

@GuyAv46
Copy link
Collaborator

GuyAv46 commented Mar 24, 2026

Don’t we already hold the locks when calling this API (through the batch iterator)?

@meiravgri
Copy link
Collaborator

@alonre24 Label count is an internal value in svs. We would need to ask them to change it if we want to take the atomic approach. for hnsw and flat we can change them to be atomic.
@GuyAv46 VecSimIndex_PreferAdHocSearch is an independent API and is called by redisearch. Redisearch only acquires the locks for adhoc bf iterator which doesn't involve calling VecSimIndex_PreferAdHocSearch

@GuyAv46
Copy link
Collaborator

GuyAv46 commented Mar 24, 2026

@meiravgri we do call VecSimIndex_PreferAdHocSearch while we have a batch iterator, and the HNSW tiered batch iterator holds the main index guard throughout the execution, and releases it in its destructor (or if it's depleted), so the current solution is unsafe

@meiravgri
Copy link
Collaborator

@GuyAv46 Right. i missed that, thanks!
So for that call we are already ok, we just need to convert the flat index size variable to atomic.
for the un gurded read we can introduce VecSimIndex_PreferAdHocSearch_Safe that would lock. This one is called once when creating the iterator so it should be ok

@meiravgri
Copy link
Collaborator

@alonre24 @GuyAv46

There are two call sites for preferAdHocSearch from RediSearch:

  1. Initial heuristic check — no locks held.
  2. Re-check during batch iterationmainIndexGuard is already held shared by the batch iterator, but flatIndexGuard is not held.

Several options were considered for fixing the race in preferAdHocSearch:

Option A: Lock at the RediSearch call sites. Use the existing AcquireSharedLocks/ReleaseSharedLocks API to protect call site 1 (which has no locks), and accept the racy BF read in call site 2 (which already holds mainIndexGuard but not flatIndexGuard). Problem: this only solves half the issue — call site 2 still reads BF's indexSize/indexLabelCount without the flat lock.

Option B: Atomic shadow counters. Maintain std::atomic<size_t> for label count and index size in each index type, updated on every add/delete. preferAdHocSearch reads atomics instead of the actual data structures. Problem: this is invasive — for multi-value indexes (BF and HNSW), indexLabelCount() returns a map size (labelToIdsLookup.size() / labelLookup.size()). Shadowing this with an atomic counter means manually tracking when a label is truly added (first vector for that label) vs. when it's truly removed (last vector of that label), duplicating logic that already lives in the map operations. This is error-prone and adds a maintenance burden. For SVS, the label count lives inside Intel's library, so we'd need a separate shadow counter there too.

Option C: Accept the race, disable SVS bounds check. preferAdHocSearch is a heuristic — a stale or slightly off value only affects the choice between ad-hoc and batch search, not the correctness of results. Intel has already merged a PR (intel/ScalableVectorSearch#301) that allows disabling the bounds check that causes the exception via SVS_EXPERIMENTAL_CHECK_BOUNDS=OFF. Once they release a version with this change, we will upgrade and compile with it disabled.

Decision: We're going with option C. Since preferAdHocSearch is purely a heuristic, the racy read is benign across all index types — the worst case is a suboptimal search strategy, never incorrect results. Adding locking or atomic counters adds complexity for no real correctness gain. We will remove the locks from preferAdHocSearch in this PR and keep the debugInfoIterator lock fix (which has no recursive lock issue and no perf concern).

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

@meiravgri meiravgri enabled auto-merge March 25, 2026 15:23
@meiravgri meiravgri added this pull request to the merge queue Mar 25, 2026
Merged via the queue into main with commit 42fdc90 Mar 25, 2026
17 checks passed
@meiravgri meiravgri deleted the joan-fix-flaky-test branch March 25, 2026 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants