fix: login race causes subsequent calls to target previous user#2618
fix: login race causes subsequent calls to target previous user#2618
Conversation
OneSignal.login() scheduled the user switch asynchronously via suspendifyOnIO, so the API call returned before the identity model was updated. Calls made right after login (e.g. addTag) applied to the OLD user because the switch hadn't happened yet. Split login into two phases: - switchUser(): synchronous under the login/logout lock. Swaps the local identity/properties/subscription models immediately so subsequent SDK calls see the new user. - enqueueLogin(): async. Enqueues the LoginUserOperation and awaits completion of the backend user creation. Because switchUser returns before enqueueLogin runs, there's a small window where RecoverFromDroppedLoginBug can observe the state (new externalId, local onesignalId, no LoginUserOperation in queue) and enqueue its own recovery login. When the real enqueueLogin then adds another login op, the executor crashes on a grouped batch containing two LoginUserOperations. Add a dedupe check in internalEnqueue: if a LoginUserOperation for the same onesignalId is already queued, drop the incoming one and wake its waiter (so loginSuspend callers don't hang).
| // Dedupe LoginUserOperation for the same user — prevents RecoverFromDroppedLoginBug | ||
| // and the real login() call from both enqueuing a login op during the timing window. | ||
| // Wake the waiter as if we succeeded, since the already-queued op will do the work | ||
| // and enqueueAndWait callers (e.g. loginSuspend) would otherwise hang forever. | ||
| val op = queueItem.operation | ||
| if (op is LoginUserOperation && | ||
| queue.any { it.operation is LoginUserOperation && it.operation.onesignalId == op.onesignalId } | ||
| ) { | ||
| Logging.debug("OperationRepo: internalEnqueue - LoginUserOperation for onesignalId: ${op.onesignalId} already exists in the queue.") | ||
| queueItem.waiter?.wake(true) | ||
| return | ||
| } | ||
|
|
There was a problem hiding this comment.
🔴 The new dedup check (OperationRepo.kt lines 170-177) unconditionally keeps whichever LoginUserOperation was enqueued first, but in the anonymous-user-conversion race the first-enqueued op (from RecoverFromDroppedLoginBug) always has existingOnesignalId=null while the second (from enqueueLogin) carries the real anon UUID needed to merge the user's data. The dedup silently drops the conversion-carrying op and wakes its waiter with true, so the backend only receives the null-existingOnesignalId op and never performs the merge; the anonymous user's tags and subscriptions are permanently lost. Fix by replacing the existing queued op with the incoming one when the incoming op has a non-null existingOnesignalId, or by merging existingOnesignalId into the already-queued item before discarding the duplicate.
Extended reasoning...
What the bug is and how it manifests
The dedup block added at OperationRepo.kt lines 170-177 drops any LoginUserOperation whose onesignalId already appears in the queue and wakes the waiter with true. The intent is to prevent RecoverFromDroppedLoginBug and the real enqueueLogin() from both queuing an op for the same user. However, the check is purely identity-based: it does not compare the information content of the two ops. When RecoverFromDroppedLoginBug enqueues first with existingOnesignalId=null and the real login enqueues second with existingOnesignalId='<anon-uuid>', the second op—which carries the anonymous-user conversion link—is silently discarded.
The specific code path that triggers it
- An anonymous user (externalId=null, onesignalId='old-anon-uuid') calls
login('user1'). LoginHelper.switchUser('user1')runs synchronously under the loginLogoutLock: it creates a new local user (onesignalId='new-local-uuid', externalId='user1') and returnsLoginEnqueueContext(existingOnesignalId='old-anon-uuid').- At this point externalId='user1' is visible in the identity model but no LoginUserOperation has been enqueued yet.
RecoverFromDroppedLoginBug.start()runs in a suspendifyOnIO coroutine that awaitsOperationRepo.awaitInitialized(). Once the queue is loaded from disk,isInBadState()observes: externalId non-null, local onesignalId, zero LoginUserOperations in queue → bad state. It callsrecoverByAddingBackDroppedLoginOperation()which hardcodesexistingOnesignalId=null(RecoverFromDroppedLoginBug.kt line ~84) and enqueuesLoginUserOperation(onesignalId='new-local-uuid', existingOnesignalId=null)viascope.launch.enqueueLogin(context)also callsoperationRepo.enqueueAndWait()viascope.launch, enqueuingLoginUserOperation(onesignalId='new-local-uuid', existingOnesignalId='old-anon-uuid').- Both launches execute on the single-threaded OSOperationRepoScope. If step 4's launch is dispatched before step 5's, the dedup check sees the queue already contains a LoginUserOperation for 'new-local-uuid' and drops step 5's op, waking its waiter with
true.
Why existing code doesn't prevent it
The comment in the dedup block explicitly acknowledges the RecoverFromDroppedLoginBug interaction but assumes the two ops are functionally equivalent. They are not: recoverByAddingBackDroppedLoginOperation() always passes null for existingOnesignalId because it has no stored reference to the anonymous-user's UUID at recovery time. The dedup block has no preference logic—first-in wins unconditionally.
Impact
Anonymous user conversion is silently broken. Tags, subscriptions, and any other data accumulated under the anonymous onesignalId are never merged into the identified user on the backend. The enqueueAndWait caller receives true (success), so no error is surfaced to the app. This is a data-loss regression introduced by this PR.
How to fix it
When the incoming LoginUserOperation has a non-null existingOnesignalId and the already-queued op has null, the correct behavior is to either (a) replace the queued op's existingOnesignalId in place (updating the persisted model too) and then discard the duplicate, or (b) remove the already-queued op and enqueue the incoming one instead. Option (a) is simpler:
if (op is LoginUserOperation) {
val existing = queue.firstOrNull {
it.operation is LoginUserOperation && it.operation.onesignalId == op.onesignalId
}
if (existing \!= null) {
// Prefer the op with conversion info if the queued one lacks it.
if (op.existingOnesignalId \!= null &&
(existing.operation as LoginUserOperation).existingOnesignalId == null
) {
(existing.operation as LoginUserOperation).existingOnesignalId = op.existingOnesignalId
// also update the persisted model
}
queueItem.waiter?.wake(true)
return
}
}Step-by-step proof
- T0: Anonymous user A (onesignalId='anon-1') exists. App calls
login('bob'). - T1:
switchUser('bob')runs synchronously: new user B created (onesignalId='local-2', externalId='bob'). Returns context(existingOnesignalId='anon-1'). Identity model now shows externalId='bob'. - T2: SDK init completes → awaitInitialized() resolves in RecoverFromDroppedLoginBug's coroutine.
- T3:
isInBadState(): externalId='bob', local onesignalId='local-2', queue is empty → true. CallsrecoverByAddingBackDroppedLoginOperation(existingOnesignalId=null).scope.launch { internalEnqueue(LoginUserOp('local-2', null)) }is posted to OSOperationRepoScope. - T4:
enqueueLogin(context)callsenqueueAndWait.scope.launch { internalEnqueue(LoginUserOp('local-2', 'anon-1')) }is posted to OSOperationRepoScope. - T5 (on OSOperationRepoScope): T3's launch runs first. Queue now contains LoginUserOp('local-2', existingId=null).
- T6 (on OSOperationRepoScope): T4's launch runs. Dedup: queue already has LoginUserOp for 'local-2'. Drops T4's op, wakes waiter with true.
- Result: Only LoginUserOp('local-2', existingId=null) executes. Backend creates/identifies user 'bob' but does NOT merge 'anon-1'. User A's tags and subscriptions are lost.
Description
One Line Summary
REQUIRED - Very short description that summaries the changes in this PR.
Details
Motivation
REQUIRED - Why is this code change being made? Or what is the goal of this PR? Examples: Fixes a specific bug, provides additional logging to debug future issues, feature to allow X.
Scope
RECOMMEND - OPTIONAL - What is intended to be effected. What is known not to change. Example: Notifications are grouped when parameter X is set, not enabled by default.
OPTIONAL - Other
OPTIONAL - Feel free to add any other sections or sub-sections that can explain your PR better.
Testing
Unit testing
OPTIONAL - Explain unit tests added, if not clear in the code.
Manual testing
RECOMMEND - OPTIONAL - Explain what scenarios were tested and the environment.
Example: Tested opening a notification while the app was foregrounded, app build with Android Studio 2020.3 with a fresh install of the OneSignal example app on a Pixel 6 with Android 12.
Affected code checklist
Checklist
Overview
Testing
Final pass