fix: Dedupe LoginUserOperation at enqueue time by nan-li · Pull Request #2617 · OneSignal/OneSignal-Android-SDK

nan-li · 2026-04-17T15:40:29Z

One Line Summary

Dedupe LoginUserOperation at enqueue time to prevent duplicate login ops from crashing the executor.

Details

Motivation

After #2600, OneSignal.login() was split into switchUser() (synchronous) and enqueueLogin() (async via suspendifyOnIO). This introduced a window between the two steps where RecoverFromDroppedLoginBug.start() can fire and observe:

externalId is set (login was just called)
onesignalId is a local ID (user not yet created on backend)
No LoginUserOperation in the queue yet (enqueue is still pending)

This matches the "dropped login bug" criteria, so it enqueues its own recovery LoginUserOperation. Shortly after, the real enqueueLogin() adds another one. When both ops land in the queue they get grouped (same modifyComparisonKey), and LoginUserOperationExecutor.createUser() throws "Unrecognized operation" on the second LoginUserOperation in the batch — dropping both ops and logging "Could not login user". Before #2600, login() enqueued synchronously so RecoverFromDroppedLoginBug would always see the op already queued and skip.

Scope

OperationRepo.internalEnqueue() — added a second dedupe check: if a LoginUserOperation for the same onesignalId is already queued, drop the incoming one.
The dedupe path wakes the dropped op's waiter with true so loginSuspend() callers don't hang on enqueueAndWait.waitForWake() forever.
Not IV/JWT-specific — this race affects any login flow.

Testing

Manual testing

Fresh install with cached anonymous user: call OneSignal.login() right after initWithContext → user created on backend, no "Unrecognized operation" error in logs.
Verified the executor no longer receives a grouped [login-user, create-subscription, login-user] batch.

Affected code checklist

REST API requests

Checklist

Overview

I have filled out all REQUIRED sections above
PR does one thing
Any Public API changes are explained in the PR details and conform to existing APIs

Testing

I have included test coverage for these changes, or explained why they are not needed
All automated tests pass, or I explained why that is not possible
I have personally tested this on my device, or explained why that is not possible

Final pass

Code is as readable as possible.
I have reviewed this PR myself, ensuring it meets each checklist item

Prevents RecoverFromDroppedLoginBug and a concurrent OneSignal.login() call from both enqueuing a LoginUserOperation for the same user. The race opens because login() splits switchUser (sync) from enqueueLogin (async) — RecoverFromDroppedLoginBug can fire during the gap and see no pending login op, enqueuing its own recovery. When both ops ultimately land in the queue they get grouped, and the executor crashes on the unexpected second LoginUserOperation in the batch.

claude · 2026-04-17T15:56:59Z

+            ) {
+                Logging.debug("OperationRepo: internalEnqueue - LoginUserOperation for onesignalId: ${op.onesignalId} already exists in the queue.")
+                queueItem.waiter?.wake(true)
+                return
+            }
+
            if (index != null) {
                queue.add(index, queueItem)
            } else {


🔴 The new dedupe block has two issues: (1) it calls waiter.wake(true) immediately on the deduped caller before the original operation completes — if that op ultimately fails (e.g. FAIL_PAUSE_OPREPO, FAIL_UNAUTHORIZED for anonymous user), the caller silently receives a false success; (2) when loadSavedOperations() loads two persisted LoginUserOperations with the same onesignalId (state left from a pre-PR race), the dedupe drops the second op without calling _operationModelStore.remove(queueItem.operation.id), leaving it orphaned in the store for the entire session until the next cold start.

Extended reasoning...

Bug 1 — False success signal on deduped waiter

When a LoginUserOperation is detected as a duplicate in internalEnqueue, the code immediately calls queueItem.waiter?.wake(true) and returns. This unblocks any enqueueAndWait caller (e.g. loginSuspend) with a success result before the already-queued operation has actually executed.

The specific failure paths that expose this: if the original queued op (op_A) returns FAIL_PAUSE_OPREPO (server 5xx, repo paused), the deduped caller already returned true from enqueueAndWait and the login state machine proceeds as if login succeeded — but the queue is now paused and all subsequent enqueued operations will never execute for this session. Similarly, FAIL_UNAUTHORIZED for an anonymous user (null externalId) causes op_A to be dropped with wake(false), while the deduped caller already got true.

The current code path that surfaces this: LoginHelper.enqueueLogin (the only direct caller of enqueueAndWait with a LoginUserOperation) checks if (!result) { Logging.warn("Could not login user") }. With the false success, this warning is never triggered even when login actually failed. Since loginSuspend returns Unit and callers don't inspect the boolean further today, the practical impact is limited to a missing log line — but the contract violation is real and future callers that act on the result would be silently misled.

The PR comment acknowledges the trade-off ("Wake the waiter as if we succeeded to avoid hanging forever"). A more precise fix would be to let op_A's actual result propagate to the deduped waiter by chaining it (e.g. storing deduped waiters on the existing queue item and waking them when op_A completes), though that requires more restructuring.

Bug 2 — Store leak when dedup fires during loadSavedOperations

loadSavedOperations() iterates persisted ops and calls internalEnqueue(..., addToStore=false). If the on-disk store already contains two LoginUserOperations with the same onesignalId (which the pre-PR race condition could produce), the dedupe check fires on the second op and returns early — before the if (addToStore) block, and crucially without ever calling _operationModelStore.remove().

Concrete step-by-step:

Pre-PR code writes two LoginUserOperation(onesignalId="local-AAA") entries to the store during a race.

App restarts with this PR installed. loadSavedOperations() loads them in reversed order.

op_A is enqueued successfully into the in-memory queue.

op_B hits the dedupe check — queue.any { ... onesignalId == "local-AAA" } is true. Returns early. _operationModelStore.remove(op_B.id) is never called.

op_A executes, calls _operationModelStore.remove(op_A.id). op_B's ID remains in the store.

Next cold start: op_B is the only LoginUserOperation in the store, so no dedup fires — it executes as an idempotent upsert and is finally removed.

The leak is self-healing after exactly one additional cold start, and the stale login op runs idempotently (upsert path). However, the orphaned entry persists for the entire current session.

Fix for bug 2: Inside the dedupe branch, add if (!addToStore) _operationModelStore.remove(queueItem.operation.id) before return. Since addToStore=false is exclusively used by loadSavedOperations, this correctly means 'this op came from the store; since we are dropping it from the queue, clean it from the store too.'

nan-li changed the title ~~Dedupe LoginUserOperation at enqueue time~~ fix: Dedupe LoginUserOperation at enqueue time Apr 17, 2026

nan-li requested a review from abdulraqeeb33 April 17, 2026 15:46

claude bot reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Dedupe LoginUserOperation at enqueue time#2617

fix: Dedupe LoginUserOperation at enqueue time#2617
nan-li wants to merge 1 commit intofix/jwt_misc_race_conditionsfrom
fix/dedupe-login-operation

nan-li commented Apr 17, 2026 •

edited

Loading

Uh oh!

claude bot Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nan-li commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

One Line Summary

Details

Motivation

Scope

Testing

Manual testing

Affected code checklist

Checklist

Overview

Testing

Final pass

Uh oh!

claude bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nan-li commented Apr 17, 2026 •

edited

Loading