Skip to content

fix: login race causes subsequent calls to target previous user#2618

Open
nan-li wants to merge 1 commit intomainfrom
fix/login-race-condition-user-data-mixed
Open

fix: login race causes subsequent calls to target previous user#2618
nan-li wants to merge 1 commit intomainfrom
fix/login-race-condition-user-data-mixed

Conversation

@nan-li
Copy link
Copy Markdown
Contributor

@nan-li nan-li commented Apr 17, 2026

Description

One Line Summary

REQUIRED - Very short description that summaries the changes in this PR.

Details

Motivation

REQUIRED - Why is this code change being made? Or what is the goal of this PR? Examples: Fixes a specific bug, provides additional logging to debug future issues, feature to allow X.

Scope

RECOMMEND - OPTIONAL - What is intended to be effected. What is known not to change. Example: Notifications are grouped when parameter X is set, not enabled by default.

OPTIONAL - Other

OPTIONAL - Feel free to add any other sections or sub-sections that can explain your PR better.

Testing

Unit testing

OPTIONAL - Explain unit tests added, if not clear in the code.

Manual testing

RECOMMEND - OPTIONAL - Explain what scenarios were tested and the environment.
Example: Tested opening a notification while the app was foregrounded, app build with Android Studio 2020.3 with a fresh install of the OneSignal example app on a Pixel 6 with Android 12.

Affected code checklist

  • Notifications
    • Display
    • Open
    • Push Processing
    • Confirm Deliveries
  • Outcomes
  • Sessions
  • In-App Messaging
  • REST API requests
  • Public API changes

Checklist

Overview

  • I have filled out all REQUIRED sections above
  • PR does one thing
    • If it is hard to explain how any codes changes are related to each other then it most likely needs to be more than one PR
  • Any Public API changes are explained in the PR details and conform to existing APIs

Testing

  • I have included test coverage for these changes, or explained why they are not needed
  • All automated tests pass, or I explained why that is not possible
  • I have personally tested this on my device, or explained why that is not possible

Final pass

  • Code is as readable as possible.
    • Simplify with less code, followed by splitting up code into well named functions and variables, followed by adding comments to the code.
  • I have reviewed this PR myself, ensuring it meets each checklist item
    • WIP (Work In Progress) is ok, but explain what is still in progress and what you would like feedback on. Start the PR title with "WIP" to indicate this.

OneSignal.login() scheduled the user switch asynchronously via
suspendifyOnIO, so the API call returned before the identity model
was updated. Calls made right after login (e.g. addTag) applied to
the OLD user because the switch hadn't happened yet.

Split login into two phases:
- switchUser(): synchronous under the login/logout lock. Swaps the
  local identity/properties/subscription models immediately so
  subsequent SDK calls see the new user.
- enqueueLogin(): async. Enqueues the LoginUserOperation and awaits
  completion of the backend user creation.

Because switchUser returns before enqueueLogin runs, there's a small
window where RecoverFromDroppedLoginBug can observe the state (new
externalId, local onesignalId, no LoginUserOperation in queue) and
enqueue its own recovery login. When the real enqueueLogin then adds
another login op, the executor crashes on a grouped batch containing
two LoginUserOperations. Add a dedupe check in internalEnqueue: if a
LoginUserOperation for the same onesignalId is already queued, drop
the incoming one and wake its waiter (so loginSuspend callers don't
hang).
Comment on lines +165 to +177
// Dedupe LoginUserOperation for the same user — prevents RecoverFromDroppedLoginBug
// and the real login() call from both enqueuing a login op during the timing window.
// Wake the waiter as if we succeeded, since the already-queued op will do the work
// and enqueueAndWait callers (e.g. loginSuspend) would otherwise hang forever.
val op = queueItem.operation
if (op is LoginUserOperation &&
queue.any { it.operation is LoginUserOperation && it.operation.onesignalId == op.onesignalId }
) {
Logging.debug("OperationRepo: internalEnqueue - LoginUserOperation for onesignalId: ${op.onesignalId} already exists in the queue.")
queueItem.waiter?.wake(true)
return
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new dedup check (OperationRepo.kt lines 170-177) unconditionally keeps whichever LoginUserOperation was enqueued first, but in the anonymous-user-conversion race the first-enqueued op (from RecoverFromDroppedLoginBug) always has existingOnesignalId=null while the second (from enqueueLogin) carries the real anon UUID needed to merge the user's data. The dedup silently drops the conversion-carrying op and wakes its waiter with true, so the backend only receives the null-existingOnesignalId op and never performs the merge; the anonymous user's tags and subscriptions are permanently lost. Fix by replacing the existing queued op with the incoming one when the incoming op has a non-null existingOnesignalId, or by merging existingOnesignalId into the already-queued item before discarding the duplicate.

Extended reasoning...

What the bug is and how it manifests

The dedup block added at OperationRepo.kt lines 170-177 drops any LoginUserOperation whose onesignalId already appears in the queue and wakes the waiter with true. The intent is to prevent RecoverFromDroppedLoginBug and the real enqueueLogin() from both queuing an op for the same user. However, the check is purely identity-based: it does not compare the information content of the two ops. When RecoverFromDroppedLoginBug enqueues first with existingOnesignalId=null and the real login enqueues second with existingOnesignalId='<anon-uuid>', the second op—which carries the anonymous-user conversion link—is silently discarded.

The specific code path that triggers it

  1. An anonymous user (externalId=null, onesignalId='old-anon-uuid') calls login('user1').
  2. LoginHelper.switchUser('user1') runs synchronously under the loginLogoutLock: it creates a new local user (onesignalId='new-local-uuid', externalId='user1') and returns LoginEnqueueContext(existingOnesignalId='old-anon-uuid').
  3. At this point externalId='user1' is visible in the identity model but no LoginUserOperation has been enqueued yet.
  4. RecoverFromDroppedLoginBug.start() runs in a suspendifyOnIO coroutine that awaits OperationRepo.awaitInitialized(). Once the queue is loaded from disk, isInBadState() observes: externalId non-null, local onesignalId, zero LoginUserOperations in queue → bad state. It calls recoverByAddingBackDroppedLoginOperation() which hardcodes existingOnesignalId=null (RecoverFromDroppedLoginBug.kt line ~84) and enqueues LoginUserOperation(onesignalId='new-local-uuid', existingOnesignalId=null) via scope.launch.
  5. enqueueLogin(context) also calls operationRepo.enqueueAndWait() via scope.launch, enqueuing LoginUserOperation(onesignalId='new-local-uuid', existingOnesignalId='old-anon-uuid').
  6. Both launches execute on the single-threaded OSOperationRepoScope. If step 4's launch is dispatched before step 5's, the dedup check sees the queue already contains a LoginUserOperation for 'new-local-uuid' and drops step 5's op, waking its waiter with true.

Why existing code doesn't prevent it

The comment in the dedup block explicitly acknowledges the RecoverFromDroppedLoginBug interaction but assumes the two ops are functionally equivalent. They are not: recoverByAddingBackDroppedLoginOperation() always passes null for existingOnesignalId because it has no stored reference to the anonymous-user's UUID at recovery time. The dedup block has no preference logic—first-in wins unconditionally.

Impact

Anonymous user conversion is silently broken. Tags, subscriptions, and any other data accumulated under the anonymous onesignalId are never merged into the identified user on the backend. The enqueueAndWait caller receives true (success), so no error is surfaced to the app. This is a data-loss regression introduced by this PR.

How to fix it

When the incoming LoginUserOperation has a non-null existingOnesignalId and the already-queued op has null, the correct behavior is to either (a) replace the queued op's existingOnesignalId in place (updating the persisted model too) and then discard the duplicate, or (b) remove the already-queued op and enqueue the incoming one instead. Option (a) is simpler:

if (op is LoginUserOperation) {
    val existing = queue.firstOrNull {
        it.operation is LoginUserOperation && it.operation.onesignalId == op.onesignalId
    }
    if (existing \!= null) {
        // Prefer the op with conversion info if the queued one lacks it.
        if (op.existingOnesignalId \!= null &&
            (existing.operation as LoginUserOperation).existingOnesignalId == null
        ) {
            (existing.operation as LoginUserOperation).existingOnesignalId = op.existingOnesignalId
            // also update the persisted model
        }
        queueItem.waiter?.wake(true)
        return
    }
}

Step-by-step proof

  • T0: Anonymous user A (onesignalId='anon-1') exists. App calls login('bob').
  • T1: switchUser('bob') runs synchronously: new user B created (onesignalId='local-2', externalId='bob'). Returns context(existingOnesignalId='anon-1'). Identity model now shows externalId='bob'.
  • T2: SDK init completes → awaitInitialized() resolves in RecoverFromDroppedLoginBug's coroutine.
  • T3: isInBadState(): externalId='bob', local onesignalId='local-2', queue is empty → true. Calls recoverByAddingBackDroppedLoginOperation(existingOnesignalId=null). scope.launch { internalEnqueue(LoginUserOp('local-2', null)) } is posted to OSOperationRepoScope.
  • T4: enqueueLogin(context) calls enqueueAndWait. scope.launch { internalEnqueue(LoginUserOp('local-2', 'anon-1')) } is posted to OSOperationRepoScope.
  • T5 (on OSOperationRepoScope): T3's launch runs first. Queue now contains LoginUserOp('local-2', existingId=null).
  • T6 (on OSOperationRepoScope): T4's launch runs. Dedup: queue already has LoginUserOp for 'local-2'. Drops T4's op, wakes waiter with true.
  • Result: Only LoginUserOp('local-2', existingId=null) executes. Backend creates/identifies user 'bob' but does NOT merge 'anon-1'. User A's tags and subscriptions are lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant