MOD-14618 Fix race condition when calling `indexLabelCount` on SVS without locks by JoanFM · Pull Request #919 · RedisAI/VectorSimilarity

JoanFM · 2026-03-19T16:30:42Z

Fixes a race condition in VecSimTieredIndex::debugInfoIterator() that can crash when called while background indexing is running.

Root Cause

debugInfoIterator() accesses both sub-indexes without holding any locks. For SVS, debugInfoIterator() internally calls indexLabelCount(), which reads internal translation tables. These tables can be concurrently modified by background threads, causing a crash.

VecSimTieredIndex::debugInfoIterator()
  → backendIndex->debugInfoIterator()      ← NO LOCK HELD
    → SVSIndex::debugInfoIterator()
      → SVSIndex::debugInfo()
        → SVSIndex::indexLabelCount()      ← reads translation tables
          → CRASH: tables appear inconsistent due to concurrent modification

Fix

Hold the appropriate shared lock when accessing each sub-index's debug info:

Hold flatIndexGuard when calling frontendIndex->debugInfoIterator()
Hold mainIndexGuard when calling backendIndex->debugInfoIterator()

Each lock is released as soon as the respective iterator is created, minimizing contention.

Follow-up: Once Intel releases a new version of SVS that includes intel/ScalableVectorSearch#301, we need a follow-up PR to upgrade SVS and compile with SVS_EXPERIMENTAL_CHECK_BOUNDS=OFF.

Mark if applicable

This PR introduces API changes
This PR introduces serialization changes

Mark if applicable

This PR introduces API changes
This PR introduces serialization changes

Note

Low Risk
Low risk concurrency fix that only adds shared-lock guards around frontendIndex/backendIndex debug-info access; main concern is potential for minor lock contention in debug/info paths.

Overview
Prevents a race when retrieving tiered index debug info during background indexing by acquiring flatIndexGuard and mainIndexGuard (shared) while calling frontendIndex->debugInfoIterator() and backendIndex->debugInfoIterator().

Locks are scoped independently per sub-index to keep contention minimal while ensuring thread-safe access to backend SVS debug paths that may touch mutable translation tables.

^{Written by Cursor Bugbot for commit 56d0b33. This will update automatically on new commits. Configure here.}

jit-ci · 2026-03-19T16:32:19Z

🛡️ Jit Security Scan Results

✅ No security findings were detected in this PR

^{Security scan by Jit}

codecov · 2026-03-19T16:52:04Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.98%. Comparing base (3dd2dc2) to head (56d0b33).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #919   +/-   ##
=======================================
  Coverage   96.98%   96.98%           
=======================================
  Files         129      129           
  Lines        7567     7574    +7     
=======================================
+ Hits         7339     7346    +7     
  Misses        228      228

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alonre24 · 2026-03-24T10:36:34Z

Regarding protecting indexLabelCount when it is called from VecSimTieredIndex::debugInfoIterator - we should add the shared lock protection - we should not worry about perf. here.
Regarding the Heuristics - what is the suggested solution if not to acquire the shared lock? Read the index size instead and make it atomic?

@meiravgri @GuyAv46

meiravgri · 2026-03-24T10:46:29Z

@alonre24 Reading the index size instead of label count changes the heuristic as it includes marked deleted vectors.

We can wrapp only the label count call. not sure it makes that much of a difference
The exception is thrown only when a specific compilation option is set
https://github.com/intel/ScalableVectorSearch/blob/aa272e270d198106a6b595632db3ad517ebbc773/include/svs/lib/boundscheck.h#L28
This option is enabled when the build type is not Release. We can ask Intel to introduce an option to disable it from an external source. It only requires changes in the public repo. I already confirmed with them that the binaries are built with the option disabled.

GuyAv46 · 2026-03-24T10:47:23Z

We can take this fix.
IMO, both size and label count API should become safe. We can track both size and label count (atomic var we couple to the map size every time we change it)

alonre24 · 2026-03-24T11:19:59Z

@meiravgri - we can disable SVS_CHECK_BOUNDS upon building from svs source, but I'm worried about the unsafe access itself, not just the exception. So I guess we should protect index size and label count, either by locking or by using atomic vars as @GuyAv46 mentioned (whatever is simpler to implement) - I don't think it should bring any regression

GuyAv46 · 2026-03-24T11:23:30Z

Don’t we already hold the locks when calling this API (through the batch iterator)?

meiravgri · 2026-03-24T12:26:27Z

@alonre24 Label count is an internal value in svs. We would need to ask them to change it if we want to take the atomic approach. for hnsw and flat we can change them to be atomic.
@GuyAv46 VecSimIndex_PreferAdHocSearch is an independent API and is called by redisearch. Redisearch only acquires the locks for adhoc bf iterator which doesn't involve calling VecSimIndex_PreferAdHocSearch

GuyAv46 · 2026-03-24T12:42:19Z

@meiravgri we do call VecSimIndex_PreferAdHocSearch while we have a batch iterator, and the HNSW tiered batch iterator holds the main index guard throughout the execution, and releases it in its destructor (or if it's depleted), so the current solution is unsafe

meiravgri · 2026-03-24T12:56:22Z

@GuyAv46 Right. i missed that, thanks!
So for that call we are already ok, we just need to convert the flat index size variable to atomic.
for the un gurded read we can introduce VecSimIndex_PreferAdHocSearch_Safe that would lock. This one is called once when creating the iterator so it should be ok

meiravgri · 2026-03-25T15:07:52Z

@alonre24 @GuyAv46

There are two call sites for preferAdHocSearch from RediSearch:

Initial heuristic check — no locks held.
Re-check during batch iteration — mainIndexGuard is already held shared by the batch iterator, but flatIndexGuard is not held.

Several options were considered for fixing the race in preferAdHocSearch:

Option A: Lock at the RediSearch call sites. Use the existing AcquireSharedLocks/ReleaseSharedLocks API to protect call site 1 (which has no locks), and accept the racy BF read in call site 2 (which already holds mainIndexGuard but not flatIndexGuard). Problem: this only solves half the issue — call site 2 still reads BF's indexSize/indexLabelCount without the flat lock.

Option B: Atomic shadow counters. Maintain std::atomic<size_t> for label count and index size in each index type, updated on every add/delete. preferAdHocSearch reads atomics instead of the actual data structures. Problem: this is invasive — for multi-value indexes (BF and HNSW), indexLabelCount() returns a map size (labelToIdsLookup.size() / labelLookup.size()). Shadowing this with an atomic counter means manually tracking when a label is truly added (first vector for that label) vs. when it's truly removed (last vector of that label), duplicating logic that already lives in the map operations. This is error-prone and adds a maintenance burden. For SVS, the label count lives inside Intel's library, so we'd need a separate shadow counter there too.

Option C: Accept the race, disable SVS bounds check. preferAdHocSearch is a heuristic — a stale or slightly off value only affects the choice between ad-hoc and batch search, not the correctness of results. Intel has already merged a PR (intel/ScalableVectorSearch#301) that allows disabling the bounds check that causes the exception via SVS_EXPERIMENTAL_CHECK_BOUNDS=OFF. Once they release a version with this change, we will upgrade and compile with it disabled.

Decision: We're going with option C. Since preferAdHocSearch is purely a heuristic, the racy read is benign across all index types — the worst case is a suboptimal search strategy, never incorrect results. Adding locking or atomic counters adds complexity for no real correctness gain. We will remove the locks from preferAdHocSearch in this PR and keep the debugInfoIterator lock fix (which has no recursive lock issue and no perf concern).

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

src/VecSim/vec_sim_tiered_index.h

proposal to fix flaky test

ef314b4

JoanFM changed the title ~~proposal to fix flaky test~~ Fix race condition in VecSimTieredIndex::debugInfoIterator during background indexing Mar 19, 2026

clarify scope

409ffff

JoanFM requested review from GuyAv46 and meiravgri and removed request for GuyAv46 March 19, 2026 16:33

meiravgri changed the title ~~Fix race condition in VecSimTieredIndex::debugInfoIterator during background indexing~~ Fix race condition when calling indexLabelCount on SVS without locks Mar 23, 2026

fiox preferAdHocSearch and add comment to indexLabelCount()

1425e08

meiravgri changed the title ~~Fix race condition when calling indexLabelCount on SVS without locks~~ MOD-14618 Fix race condition when calling indexLabelCount on SVS without locks Mar 23, 2026

meiravgri requested a review from alonre24 March 23, 2026 10:47

revert preferAdHocSearch

56d0b33

cursor bot reviewed Mar 25, 2026

View reviewed changes

src/VecSim/vec_sim_tiered_index.h Show resolved Hide resolved

meiravgri enabled auto-merge March 25, 2026 15:23

alonre24 approved these changes Mar 25, 2026

View reviewed changes

meiravgri added this pull request to the merge queue Mar 25, 2026

Merged via the queue into main with commit 42fdc90 Mar 25, 2026
17 checks passed

meiravgri deleted the joan-fix-flaky-test branch March 25, 2026 16:39

meiravgri mentioned this pull request Mar 25, 2026

Fix HNSW vector index ghost memory leak (MOD-13761) RediSearch/RediSearch#8843

Merged

4 tasks

Conversation

JoanFM commented Mar 19, 2026 • edited by meiravgri Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jit-ci bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛡️ Jit Security Scan Results

Uh oh!

codecov bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alonre24 commented Mar 24, 2026

Uh oh!

meiravgri commented Mar 24, 2026

Uh oh!

GuyAv46 commented Mar 24, 2026

Uh oh!

alonre24 commented Mar 24, 2026

Uh oh!

GuyAv46 commented Mar 24, 2026

Uh oh!

meiravgri commented Mar 24, 2026

Uh oh!

GuyAv46 commented Mar 24, 2026

Uh oh!

meiravgri commented Mar 24, 2026

Uh oh!

meiravgri commented Mar 25, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JoanFM commented Mar 19, 2026 •

edited by meiravgri

Loading

jit-ci bot commented Mar 19, 2026 •

edited

Loading

codecov bot commented Mar 19, 2026 •

edited

Loading