fix: Dedupe LoginUserOperation at enqueue time#2617
fix: Dedupe LoginUserOperation at enqueue time#2617nan-li wants to merge 1 commit intofix/jwt_misc_race_conditionsfrom
Conversation
Prevents RecoverFromDroppedLoginBug and a concurrent OneSignal.login() call from both enqueuing a LoginUserOperation for the same user. The race opens because login() splits switchUser (sync) from enqueueLogin (async) — RecoverFromDroppedLoginBug can fire during the gap and see no pending login op, enqueuing its own recovery. When both ops ultimately land in the queue they get grouped, and the executor crashes on the unexpected second LoginUserOperation in the batch.
| ) { | ||
| Logging.debug("OperationRepo: internalEnqueue - LoginUserOperation for onesignalId: ${op.onesignalId} already exists in the queue.") | ||
| queueItem.waiter?.wake(true) | ||
| return | ||
| } | ||
|
|
||
| if (index != null) { | ||
| queue.add(index, queueItem) | ||
| } else { |
There was a problem hiding this comment.
🔴 The new dedupe block has two issues: (1) it calls waiter.wake(true) immediately on the deduped caller before the original operation completes — if that op ultimately fails (e.g. FAIL_PAUSE_OPREPO, FAIL_UNAUTHORIZED for anonymous user), the caller silently receives a false success; (2) when loadSavedOperations() loads two persisted LoginUserOperations with the same onesignalId (state left from a pre-PR race), the dedupe drops the second op without calling _operationModelStore.remove(queueItem.operation.id), leaving it orphaned in the store for the entire session until the next cold start.
Extended reasoning...
Bug 1 — False success signal on deduped waiter
When a LoginUserOperation is detected as a duplicate in internalEnqueue, the code immediately calls queueItem.waiter?.wake(true) and returns. This unblocks any enqueueAndWait caller (e.g. loginSuspend) with a success result before the already-queued operation has actually executed.
The specific failure paths that expose this: if the original queued op (op_A) returns FAIL_PAUSE_OPREPO (server 5xx, repo paused), the deduped caller already returned true from enqueueAndWait and the login state machine proceeds as if login succeeded — but the queue is now paused and all subsequent enqueued operations will never execute for this session. Similarly, FAIL_UNAUTHORIZED for an anonymous user (null externalId) causes op_A to be dropped with wake(false), while the deduped caller already got true.
The current code path that surfaces this: LoginHelper.enqueueLogin (the only direct caller of enqueueAndWait with a LoginUserOperation) checks if (!result) { Logging.warn("Could not login user") }. With the false success, this warning is never triggered even when login actually failed. Since loginSuspend returns Unit and callers don't inspect the boolean further today, the practical impact is limited to a missing log line — but the contract violation is real and future callers that act on the result would be silently misled.
The PR comment acknowledges the trade-off ("Wake the waiter as if we succeeded to avoid hanging forever"). A more precise fix would be to let op_A's actual result propagate to the deduped waiter by chaining it (e.g. storing deduped waiters on the existing queue item and waking them when op_A completes), though that requires more restructuring.
Bug 2 — Store leak when dedup fires during loadSavedOperations
loadSavedOperations() iterates persisted ops and calls internalEnqueue(..., addToStore=false). If the on-disk store already contains two LoginUserOperations with the same onesignalId (which the pre-PR race condition could produce), the dedupe check fires on the second op and returns early — before the if (addToStore) block, and crucially without ever calling _operationModelStore.remove().
Concrete step-by-step:
- Pre-PR code writes two LoginUserOperation(onesignalId="local-AAA") entries to the store during a race.
- App restarts with this PR installed. loadSavedOperations() loads them in reversed order.
- op_A is enqueued successfully into the in-memory queue.
- op_B hits the dedupe check — queue.any { ... onesignalId == "local-AAA" } is true. Returns early. _operationModelStore.remove(op_B.id) is never called.
- op_A executes, calls _operationModelStore.remove(op_A.id). op_B's ID remains in the store.
- Next cold start: op_B is the only LoginUserOperation in the store, so no dedup fires — it executes as an idempotent upsert and is finally removed.
The leak is self-healing after exactly one additional cold start, and the stale login op runs idempotently (upsert path). However, the orphaned entry persists for the entire current session.
Fix for bug 2: Inside the dedupe branch, add if (!addToStore) _operationModelStore.remove(queueItem.operation.id) before return. Since addToStore=false is exclusively used by loadSavedOperations, this correctly means 'this op came from the store; since we are dropping it from the queue, clean it from the store too.'
One Line Summary
Dedupe LoginUserOperation at enqueue time to prevent duplicate login ops from crashing the executor.
Details
Motivation
After #2600,
OneSignal.login()was split intoswitchUser()(synchronous) andenqueueLogin()(async viasuspendifyOnIO). This introduced a window between the two steps whereRecoverFromDroppedLoginBug.start()can fire and observe:externalIdis set (login was just called)onesignalIdis a local ID (user not yet created on backend)LoginUserOperationin the queue yet (enqueue is still pending)This matches the "dropped login bug" criteria, so it enqueues its own recovery
LoginUserOperation. Shortly after, the realenqueueLogin()adds another one. When both ops land in the queue they get grouped (samemodifyComparisonKey), andLoginUserOperationExecutor.createUser()throws "Unrecognized operation" on the secondLoginUserOperationin the batch — dropping both ops and logging "Could not login user". Before #2600,login()enqueued synchronously soRecoverFromDroppedLoginBugwould always see the op already queued and skip.Scope
OperationRepo.internalEnqueue()— added a second dedupe check: if aLoginUserOperationfor the sameonesignalIdis already queued, drop the incoming one.truesologinSuspend()callers don't hang onenqueueAndWait.waitForWake()forever.Testing
Manual testing
OneSignal.login()right afterinitWithContext→ user created on backend, no "Unrecognized operation" error in logs.[login-user, create-subscription, login-user]batch.Affected code checklist
Checklist
Overview
Testing
Final pass