Your SSE Connections Are Lying to You — Here's What Happens When Users Reconnect

Your user is in a live mentoring session. Their phone loses signal for three seconds. Maybe they walked past a microwave. Maybe their carrier hiccupped.
The SSE connection dies. The browser reconnects.
But during those three seconds, two events fired on the server. A billing warning: "5 minutes of credit remaining." A billing update with the latest charge. The reconnection succeeds. The client is live again.
It never saw those two events.
The UI shows a stale balance. The 5-minute warning never appeared. When the session auto-ends due to balance depletion, the user is surprised and furious.
This is the reconnection gap. And most SSE implementations don't handle it.
I ship real-time billing on two production platforms — mentoring.oakoliver.com for live session billing, and vibe.oakoliver.com for captive credit approvals. Both depend on SSE. Both would be unreliable without event replay.
This is how we guarantee that reconnecting clients catch up on every missed event, no matter how long they were offline.
I – The Protocol Nobody Reads
The SSE specification actually has built-in reconnection support.
Most developers know that the browser's EventSource reconnects automatically when a connection drops. Fewer know about the Last-Event-ID mechanism hiding in plain sight.
Here's the dance. The server sends each event with a unique ID field. The browser silently remembers the last ID it received. When the connection drops and EventSource reconnects, it sends a Last-Event-ID header with that remembered value.
The server can then use this header to replay every event the client missed.
That last step is where most implementations fall apart. The spec tells you the mechanism exists. It does not tell you how to implement the replay. That's entirely on you.
And "on you" is where 90% of teams stop reading the spec.
II – The Three-Second Blindness Problem
Think about what happens without replay.
A billing session sends updates every 30 seconds. The user's connection drops for 3 seconds. One event fires during the gap. The client reconnects and picks up live events going forward — but it never receives the missed one.
Now scale that scenario. A mobile user on a train. Connection drops every few minutes. Each time, one or two events slip through the crack. After an hour, their UI is showing data that's silently diverged from reality.
The worst part? The user has no idea. The connection indicator says "connected." Events are flowing. Everything looks fine. The data is just wrong.
This is not a theoretical concern. This is a support ticket that reads: "Your billing is broken, I was charged for time I didn't use."
No. The billing was correct. The display was stale because the reconnection swallowed two events 40 minutes ago.
III – The Event History Buffer
The fix is deceptively simple. Every active session maintains a buffer of recent events in memory.
Think of it as a short-term memory. When an event is created and broadcast to connected clients, it's also pushed into the session's event history. The buffer is capped — we keep the last 100 events to prevent unbounded memory growth.
Why 100? A mentoring session running for two hours at 30-second billing intervals generates roughly 240 billing updates, plus a handful of warnings and pause/resume events. A buffer of 100 covers approximately 50 minutes of history.
That's more than enough for any realistic reconnection window.
If a client disconnects for longer than 50 minutes, they have bigger problems than missed events. At that point, a full state resync is more appropriate — and we handle that separately.
The memory math is negligible. Each event is roughly 200-300 bytes serialized. At 100 events per session, that's about 30KB. With 50 concurrent sessions — an ambitious scale scenario — you're looking at 1.5MB total. Less than a single high-resolution image.
IV – Event ID Design That Doesn't Betray You
The event ID format matters more than you'd think. It needs three properties: uniqueness across the session lifetime, orderability so you can find "everything after this ID," and debuggability so you can read it in logs without a lookup table.
Our format combines a monotonically increasing counter with a timestamp. Something like the 42nd event created at a specific millisecond.
The counter provides ordering. Even if two events fire in the same millisecond — unlikely but possible during a burst of pause-plus-warning events — the counter disambiguates them.
The timestamp provides debuggability. When you're reading production logs at 2 AM, you can immediately tell when an event was created without cross-referencing a mapping table.
Here's a subtlety that bit us early. The counter resets on server restart. A client reconnecting with an ID from before the restart would find no matching event in the new buffer. The replay function handles this gracefully by falling back to the most recent events.
Defense in depth. Never assume your identifiers survive infrastructure events.
V – The Replay Function: Three Cases, Three Strategies
When a client reconnects and sends a Last-Event-ID header, the server needs to decide what to replay. There are exactly three scenarios.
First: a brand new connection. No Last-Event-ID header. The client has never connected before, or they just opened the page. Return the last 10 events to provide immediate context. This typically includes the most recent billing update and any recent warnings — enough for the UI to render the current state without any additional API calls.
Second: a reconnection with a known event ID. The client was connected, lost the connection, and reconnected. Find that ID in the buffer and return every event after it. If the disconnection was 3 seconds, that might be zero or one event. If it was 5 minutes, that might be 10. Either way, the client processes them in order and catches up.
Third: an unrecognized event ID. The server restarted (counter reset), or the client was disconnected so long that the event rolled out of the 100-event buffer. Fall back to the last 10 events — same as a new connection. The client gets current state even if it missed some intermediate events.
Every case has a safe fallback. No case throws an error. No case leaves the client in a broken state.
This is the kind of resilience that separates "works in demo" from "works in production."
VI – The Connection Initialization Sequence
When a client connects (or reconnects), the server runs a three-step sequence that takes the client from "disconnected" to "fully synchronized" in a single connection establishment.
Step one: connection confirmation. The server sends a handshake event with a client ID. This tells the client it's connected and provides an identifier for heartbeat acknowledgment.
Step two: event replay. If the client sent a Last-Event-ID, the server replays every missed event in order, each with its original ID. The client processes them sequentially and catches up.
Step three: a current state snapshot. Even if no events were missed, the client receives a fresh billing update so it renders the latest numbers immediately.
The beauty of this sequence is what it eliminates. No secondary API calls to fetch current state. No "loading" spinner after reconnection. No polling endpoint to check "did I miss anything?" The SSE stream itself carries all the data needed.
A reconnecting client goes from zero to fully synchronized in a single TCP round-trip.
VII – Client-Side Reconnection: Exponential Backoff
On the React side, reconnection uses exponential backoff. One second, two seconds, four, eight, sixteen — capped at 30 seconds. After a configurable maximum number of attempts, the hook gives up and surfaces an error to the UI.
Why not rely on EventSource's built-in reconnection? Two reasons.
First, the built-in reconnection fires on transient network errors with a default retry interval, but it doesn't handle all failure modes. Some conditions — like HTTP 4xx responses — cause EventSource to give up entirely.
Second, you can't control the backoff strategy. The spec says servers can suggest a retry interval, but client implementations vary across browsers. Manual backoff gives consistent behavior everywhere.
We layer both. EventSource handles the easy cases automatically. Our manual reconnection with backoff catches the edge cases. Belt and suspenders.
VIII – The Idempotency Problem Nobody Talks About
Here's a subtlety that catches most implementations: replayed events must be idempotent.
Consider this. A client receives a billing update showing 23 minutes elapsed. It disconnects. The server sends an update showing 24 minutes — missed. The client reconnects and receives the replayed 24-minute event. Then the next live event arrives: 24.5 minutes.
If the client naively appends events to a list, it shows duplicate entries. If the client treats each billing update as "the latest state" — overwriting the previous one — it works correctly.
Our billing updates are "latest wins." Each one contains the complete current state. There's no delta, no increment that would break on replay. The client just renders whatever the most recent event says.
This makes replay inherently idempotent. No deduplication logic needed. No sequence number tracking on the client.
For one-time events like billing warnings — "5 minutes remaining" — idempotency is different. The warning callback should fire only once per threshold. The client tracks which warning types have already been shown in a set. If the same warning arrives again via replay, the set catches the duplicate and silently ignores it.
Design your events for replay from day one. If every event is either "latest wins" or deduplicated by type, replay becomes trivial.
IX – Adapting the Pattern Across Platforms
We use this exact replay pattern in two different contexts.
On the mentoring platform, events are session-scoped. Each billing session maintains its own event history. Sessions are ephemeral — minutes to hours. The buffer of 100 events with a replay window of 10 is tuned for 30-second billing ticks.
On Vibe, events are user-scoped. Each user has a billing event channel for credit holds and payment approvals. Events are less frequent — a hold-and-approve pair is typically just 2 events per paid action. The buffer is smaller (50 events) and the replay window is 5.
The critical replay scenario on Vibe: a user triggers a paid action, the approval-required event fires via SSE, but the client reconnects before receiving it. Without replay, the approval modal never appears and the API request hangs until timeout. With replay, the modal appears immediately after reconnection.
Same pattern, different tuning parameters. The underlying logic is identical. After implementing it twice, we extracted it into a reusable primitive — a generic replay buffer that takes a max size and default replay count as constructor arguments.
The abstraction is small enough to inline but clean enough to share across services.
X – Why Not External Event Stores?
We considered several alternatives before settling on in-memory buffers.
Redis Streams would survive server restarts and support horizontal scaling. But we run a single VPS. Redis adds operational complexity — another service to monitor, backup, and restart. The in-memory buffer handles 99.9% of reconnection scenarios. The 0.1% edge case (server restart during an active session) is handled by the database fallback.
Client-side event caching in localStorage or IndexedDB would let the client reconcile on reconnect. But that shifts complexity to the client. Every platform — web, mobile, CLI — would need its own reconciliation logic. Server-side replay keeps clients thin.
Full event sourcing with complete history would let you replay from any point in time. But billing sessions are ephemeral. Full event sourcing is designed for long-lived aggregates. A capped buffer matches the temporal nature of our use case.
Always match the solution to the problem's time horizon. Our events matter for minutes to hours, not days to years. The tooling should reflect that.
Building real-time systems and want to go deeper? I run 100+ mentoring sessions on architecture patterns like this at mentoring.oakoliver.com, and the Vibe platform at vibe.oakoliver.com is where these patterns ship to production. Explore both if you want to see event replay in the wild.
XI – Lessons That Survive Any Stack
After running event replay in production across two platforms, here's what I'd tell anyone building SSE-based real-time systems.
Always include event IDs in your SSE messages. Even if you don't implement replay today, the IDs make debugging easier and leave the door open for replay later. Adding IDs retroactively requires client updates. Adding them from the start costs nothing.
"Latest wins" semantics simplify everything. If each event contains the complete current state rather than a delta, replay is automatically idempotent. You don't need deduplication logic or sequence number tracking on the client. Design for this from the beginning.
Cap your buffers explicitly. Unbounded event history is a memory leak waiting for a long-running session to find it. Pick a cap based on your event frequency and maximum expected disconnection window. Then add a safety margin.
Handle the "unknown ID" case gracefully. Server restarts, buffer rollovers, and clock skew all produce IDs the server doesn't recognize. Falling back to the last N events is always a safe default. Never throw an error on an unrecognized Last-Event-ID.
Test reconnection explicitly. It's tempting to assume EventSource handles it. It handles the transport layer. Your application-level replay logic needs its own integration tests — tests that simulate disconnection, wait for events to fire, reconnect with a Last-Event-ID, and verify that exactly the right events are replayed.
XII – Catch-Up Consistency
With event replay in place, our SSE connections provide a specific guarantee.
For disconnections shorter than the buffer window — roughly 50 minutes at 30-second intervals — the client will receive every event it missed, in order, upon reconnection.
For longer disconnections, the client receives a recent state snapshot sufficient to render the current UI correctly.
This isn't eventual consistency. It's catch-up consistency. The client might lag behind temporarily during disconnection, but it never settles into an incorrect state. The moment it reconnects, it converges to the server's truth through replay.
That's the difference between "good enough" real-time and production-grade real-time. And it takes fewer than 50 lines of server-side logic to achieve.
Most SSE tutorials stop at "EventSource reconnects automatically." They leave out the part where your users stare at stale data and wonder why your app is broken.
Don't be most tutorials.
What's the most painful real-time reliability bug you've shipped to production — and how long did it take you to find it?
– Antonio