Skip to content

fix: Dedupe LoginUserOperation at enqueue time#2617

Open
nan-li wants to merge 1 commit intofix/jwt_misc_race_conditionsfrom
fix/dedupe-login-operation
Open

fix: Dedupe LoginUserOperation at enqueue time#2617
nan-li wants to merge 1 commit intofix/jwt_misc_race_conditionsfrom
fix/dedupe-login-operation

Conversation

@nan-li
Copy link
Copy Markdown
Contributor

@nan-li nan-li commented Apr 17, 2026

One Line Summary

Dedupe LoginUserOperation at enqueue time to prevent duplicate login ops from crashing the executor.

Details

Motivation

After #2600, OneSignal.login() was split into switchUser() (synchronous) and enqueueLogin() (async via suspendifyOnIO). This introduced a window between the two steps where RecoverFromDroppedLoginBug.start() can fire and observe:

  • externalId is set (login was just called)
  • onesignalId is a local ID (user not yet created on backend)
  • No LoginUserOperation in the queue yet (enqueue is still pending)

This matches the "dropped login bug" criteria, so it enqueues its own recovery LoginUserOperation. Shortly after, the real enqueueLogin() adds another one. When both ops land in the queue they get grouped (same modifyComparisonKey), and LoginUserOperationExecutor.createUser() throws "Unrecognized operation" on the second LoginUserOperation in the batch — dropping both ops and logging "Could not login user". Before #2600, login() enqueued synchronously so RecoverFromDroppedLoginBug would always see the op already queued and skip.

Scope

  • OperationRepo.internalEnqueue() — added a second dedupe check: if a LoginUserOperation for the same onesignalId is already queued, drop the incoming one.
  • The dedupe path wakes the dropped op's waiter with true so loginSuspend() callers don't hang on enqueueAndWait.waitForWake() forever.
  • Not IV/JWT-specific — this race affects any login flow.

Testing

Manual testing

  • Fresh install with cached anonymous user: call OneSignal.login() right after initWithContext → user created on backend, no "Unrecognized operation" error in logs.
  • Verified the executor no longer receives a grouped [login-user, create-subscription, login-user] batch.

Affected code checklist

  • REST API requests

Checklist

Overview

  • I have filled out all REQUIRED sections above
  • PR does one thing
  • Any Public API changes are explained in the PR details and conform to existing APIs

Testing

  • I have included test coverage for these changes, or explained why they are not needed
  • All automated tests pass, or I explained why that is not possible
  • I have personally tested this on my device, or explained why that is not possible

Final pass

  • Code is as readable as possible.
  • I have reviewed this PR myself, ensuring it meets each checklist item

Prevents RecoverFromDroppedLoginBug and a concurrent OneSignal.login()
call from both enqueuing a LoginUserOperation for the same user. The
race opens because login() splits switchUser (sync) from enqueueLogin
(async) — RecoverFromDroppedLoginBug can fire during the gap and see
no pending login op, enqueuing its own recovery. When both ops
ultimately land in the queue they get grouped, and the executor
crashes on the unexpected second LoginUserOperation in the batch.
@nan-li nan-li changed the title Dedupe LoginUserOperation at enqueue time fix: Dedupe LoginUserOperation at enqueue time Apr 17, 2026
@nan-li nan-li requested a review from abdulraqeeb33 April 17, 2026 15:46
Comment on lines +183 to 191
) {
Logging.debug("OperationRepo: internalEnqueue - LoginUserOperation for onesignalId: ${op.onesignalId} already exists in the queue.")
queueItem.waiter?.wake(true)
return
}

if (index != null) {
queue.add(index, queueItem)
} else {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new dedupe block has two issues: (1) it calls waiter.wake(true) immediately on the deduped caller before the original operation completes — if that op ultimately fails (e.g. FAIL_PAUSE_OPREPO, FAIL_UNAUTHORIZED for anonymous user), the caller silently receives a false success; (2) when loadSavedOperations() loads two persisted LoginUserOperations with the same onesignalId (state left from a pre-PR race), the dedupe drops the second op without calling _operationModelStore.remove(queueItem.operation.id), leaving it orphaned in the store for the entire session until the next cold start.

Extended reasoning...

Bug 1 — False success signal on deduped waiter

When a LoginUserOperation is detected as a duplicate in internalEnqueue, the code immediately calls queueItem.waiter?.wake(true) and returns. This unblocks any enqueueAndWait caller (e.g. loginSuspend) with a success result before the already-queued operation has actually executed.

The specific failure paths that expose this: if the original queued op (op_A) returns FAIL_PAUSE_OPREPO (server 5xx, repo paused), the deduped caller already returned true from enqueueAndWait and the login state machine proceeds as if login succeeded — but the queue is now paused and all subsequent enqueued operations will never execute for this session. Similarly, FAIL_UNAUTHORIZED for an anonymous user (null externalId) causes op_A to be dropped with wake(false), while the deduped caller already got true.

The current code path that surfaces this: LoginHelper.enqueueLogin (the only direct caller of enqueueAndWait with a LoginUserOperation) checks if (!result) { Logging.warn("Could not login user") }. With the false success, this warning is never triggered even when login actually failed. Since loginSuspend returns Unit and callers don't inspect the boolean further today, the practical impact is limited to a missing log line — but the contract violation is real and future callers that act on the result would be silently misled.

The PR comment acknowledges the trade-off ("Wake the waiter as if we succeeded to avoid hanging forever"). A more precise fix would be to let op_A's actual result propagate to the deduped waiter by chaining it (e.g. storing deduped waiters on the existing queue item and waking them when op_A completes), though that requires more restructuring.

Bug 2 — Store leak when dedup fires during loadSavedOperations

loadSavedOperations() iterates persisted ops and calls internalEnqueue(..., addToStore=false). If the on-disk store already contains two LoginUserOperations with the same onesignalId (which the pre-PR race condition could produce), the dedupe check fires on the second op and returns early — before the if (addToStore) block, and crucially without ever calling _operationModelStore.remove().

Concrete step-by-step:

  1. Pre-PR code writes two LoginUserOperation(onesignalId="local-AAA") entries to the store during a race.
  2. App restarts with this PR installed. loadSavedOperations() loads them in reversed order.
  3. op_A is enqueued successfully into the in-memory queue.
  4. op_B hits the dedupe check — queue.any { ... onesignalId == "local-AAA" } is true. Returns early. _operationModelStore.remove(op_B.id) is never called.
  5. op_A executes, calls _operationModelStore.remove(op_A.id). op_B's ID remains in the store.
  6. Next cold start: op_B is the only LoginUserOperation in the store, so no dedup fires — it executes as an idempotent upsert and is finally removed.

The leak is self-healing after exactly one additional cold start, and the stale login op runs idempotently (upsert path). However, the orphaned entry persists for the entire current session.

Fix for bug 2: Inside the dedupe branch, add if (!addToStore) _operationModelStore.remove(queueItem.operation.id) before return. Since addToStore=false is exclusively used by loadSavedOperations, this correctly means 'this op came from the store; since we are dropping it from the queue, clean it from the store too.'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant