fix(realtime): reconnect WebSocket and catch up on missed events after iOS Safari backgrounding#4590
fix(realtime): reconnect WebSocket and catch up on missed events after iOS Safari backgrounding#4590ekumanov wants to merge 2 commits intoflarum:2.xfrom
Conversation
bd5b0b2 to
3fde090
Compare
imorland
left a comment
There was a problem hiding this comment.
Thanks for the thorough write-up and the real-device verification — the visibility/pageshow reconnect logic is well-reasoned and addresses a real iOS Safari issue. I'd like to see this land, but the PR as it currently stands has one significant issue and a few smaller ones that should be addressed first, especially given we're in 2.0.0-rc territory.
Main concern: undisclosed second change
The PR description covers the forceReconnect + visibilitychange + pageshow logic clearly. But commit a2e2c8d ("neutralise pusher-js WebSocket lives kill-switch") adds a second, more invasive change — patchPusherTransportManagers — which walks the pusher-js object graph up to 10 levels deep, duck-types TransportManager instances, and mutates their internals to disable the transport-death budget. This isn't mentioned in the PR body at all.
Concerns with this part:
- Reaches into pusher-js private API.
reportDeath,isAlive,livesLeftare internal to pusher-js's transport strategy, not public API. A patch-level pusher-js upgrade that renames or restructures these will silently stop matching the duck-type detector, at which point the iOS reconnect regresses with no loud failure. - The kill-switch exists for a reason. pusher-js's
lives: 2budget lets the client stop hammering a genuinely broken transport (badwsHost, blocked by network middlebox, TLS failure). PinninglivesLeft = Infinitydisables that for every failure mode, not just iOS backgrounding. The justification "our app layer owns reconnect end to end" doesn't quite hold — the app layer only reconnects on visibility/pageshow events, not on genuine transport failures. - Indiscriminate walk.
patchPusherTransportManagerspatches anything that duck-types correctly, to depth 10. Low probability of collateral damage, but still a footgun. - No reproduction steps. The issue describes scenarios 1 and 2, neither of which exercises the
lives-budget path. If the "code 1006 unclean close on the same transport" behaviour the comment describes is real and reproducible, it deserves its own issue + repro + test. - RC risk. The visibility/pageshow logic is surgical and easy to reason about. The transport-manager patch changes pusher-js failure behaviour globally via runtime introspection — not something I'd want to ship on an RC without more bake time.
Suggestion: split this into two PRs. Land the forceReconnect + visibility/pageshow handlers for rc.2 (clear win, well-tested, scoped to the reported scenarios). Open a separate PR for the transport-manager patch with its own repro steps, ideally targeting post-GA. If the lives-budget issue is genuinely blocking real users, consider a less invasive fix first — e.g. constructing a new Pusher instance in forceReconnect() rather than reusing the existing one, which sidesteps the private-field mutation entirely.
Smaller notes on the described change (non-blocking)
- Listener cleanup. The
visibilitychangeandpageshowhandlers are added insideApplication.prototype.mountextension with noremoveEventListener. On a normal page load this is fine, but a secondmount()call (test environments, any future SPA re-bootstrap) will stack handlers. Worth a one-line comment noting the assumption. - Refresh scope.
app.discussions?.refresh?.()covers the index page, but users viewing a single discussion won't see new posts via this path until they navigate. That's an acceptable v1 scope, but worth documenting as a known limitation so a follow-up can broaden it to whicheverPaginatedListStateis currently visible. - Checklist — please tick "tests have been added, or are not appropriate here" if tests aren't appropriate, and confirm
composer teststatus (even though this is JS-only, the box is on the template).
Recommendation
- Split the PR; keep the visibility/pageshow reconnect here.
- Open a separate PR for the transport-manager patch with repro steps and justification.
- Once split, I'm happy to approve the reconnect half for
rc.2.
e022ca0 to
d5ff3f4
Compare
|
Thanks for the detailed review — all points are fair, and I agree with the split. I've force-pushed On the smaller notes:
For the transport-manager patch: you're right that Ready for another look when you have a moment. |
|
Phase-2 issue filed: #4597 — includes the cycle-2 repro, pusher-js |
…r iOS Safari backgrounding iOS Safari silently drops WebSocket connections when the tab is backgrounded or the device sleeps; no `close` event fires, so pusher-js's built-in recovery never triggers and realtime updates go missing until reload. Additionally, iOS bfcaches pages on app-switch, which restores via `pageshow` (persisted=true) and does NOT fire `visibilitychange` on return — so handlers bound only to visibilitychange miss the common app-switch path. Add a `forceReconnect()` that hooks both `visibilitychange` (after >5s hidden) and `pageshow` (bfcache). It disconnects, waits 100ms to avoid a connecting- zombie race in pusher-js, then reconnects. After the new connection reports `'connected'` (i.e. after pusher-js's `subscribeAll()` has re-subscribed the channels on the fresh socket), it calls `app.discussions.refresh()` so the visible list catches up on events that fired while the socket was dead — Pusher is fire-and-forget with no server-side buffering. Refresh must be gated on the `'connected'` event rather than fired immediately after `connect()`: an immediate refresh triggers a Mithril redraw that races with pusher-js's channel resubscription on the new socket and leaves the client receiving no further push events. Fixes flarum#4588.
Includes transpiled JS/TS.
d5ff3f4 to
84ef143
Compare
Closes #4588.
What
Add
visibilitychange+pageshowhandlers inrealtime/js/src/forum/extend/Application.tsthat force a WebSocket reconnect after iOS Safari backgrounding, and — once the new connection is live — refresh the visible discussions list to catch up on events that fired while the socket was dead.Why
See the linked issue for the full diagnosis. Two compounding iOS Safari behaviors produce the same "realtime is broken after switching apps" symptom, and both need handling:
close, so pusher-js's heartbeat-based recovery never trips. The client believes it's still subscribed, but no events arrive after return.Plus one iOS-specific return path that a
visibilitychange-only fix misses: iOS bfcaches pages on app-switch. Return via bfcache restores the page withpageshow(persisted: true) and does NOT firevisibilitychange. So both events need to be hooked.How
Design choices
visibilitychange. Avoids churning the socket on brief app-switches; in practice iOS doesn't kill a socket that quickly.pageshowwithpersisted: true. Covers the bfcache-restore return path thatvisibilitychangealone misses on iOS.disconnect()thenconnect()with a 100 ms gap. pusher-js will no-opconnect()when it still believes the socket is alive (the exact symptom we're working around) — an explicit teardown is required. The 100 ms gap prevents a "connecting zombie" race where theconnect()fires while the previous teardown is still in flight. Channels are preserved in pusher-js's internal map and automatically re-subscribed on the new socket.discussions.refresh()on'connected', not immediately afterconnect(). This was the critical fix — an earlier iteration that calledrefresh()synchronously afterconnect()regressed scenario 1 (new pushes stopped arriving to the newly-reconnected socket). The immediate refresh triggers a Mithril redraw that race-conditions with pusher-js'ssubscribeAll()on the fresh socket, sometimes leaving the client bound but receiving no events. Gating on the connection's'connected'event ensures the refresh only fires after resubscription has completed, which is race-free.enabledTransportschange. Leaving['wss', 'ws']alone — HTTP polling fallback would mask the real reconnect and isn't needed once visibility-driven recovery is in place.Application.mount(). Comment in-code documents thatmount()runs once per page load, so there's no teardown path needed for the event listeners.Scope
extensions/realtime/js/src/forum/extend/Application.ts, +55 / -0).app.discussions.refresh()). The single-discussion view ("I'm reading a thread, switch apps, return — show me posts that appeared while I was away") is explicitly out of scope for v1; it needs a stream-level re-hydration that is a different code path and can land in a follow-up.Testing
yarn buildinextensions/realtime/js/succeeds.v2.0.0-beta.8staging forum:discussions.refresh()fires after the new socket reconnects and the missed content appears.pageshowpersisted=true, notvisibilitychange) triggers reconnect.Iteration history (for reviewer context, can collapse)
Why the code looks the way it does
disconnect()+connect()onvisibilitychange. Fixed scenario 1 but missed scenario 2 (events posted while away were gone) and missed bfcache returns.app.discussions.refresh()immediately afterconnect(). Fixed scenario 2.pageshowhandler for bfcache.refresh()triggered a Mithril redraw that race-conditioned with pusher-js'ssubscribeAll()on the new socket. Movedrefresh()into anonReconnectedcallback bound to the'connected'event on the connection object, so it only fires after re-subscription. All three behaviors now work.Marking ready for review.