DEV Community

Daniel Bitengo
Daniel Bitengo

Posted on • Originally published at danbitengo.hashnode.dev

The WebSocket Message That Disappeared Between the Browser and the Server

It started with a simple observation: some WebSocket messages arrived on the server, and others didn't.

I was building Synckit, a real-time sync SDK that uses CRDTs for conflict-free collaboration. The demo app, LocalWrite, is a collaborative document editor. Think Google Docs, but local-first. Two users open the same document, type at the same time, and their changes merge automatically.

Locally, everything worked. On Fly.io, it didn't.

The symptom

The server logs told a clear story:

[WS] Received message, typeCode: 0x30 (ping)               ✅
[WS] Received message, typeCode: 0x40 (awareness_update)   ✅
[WS] Received message, typeCode: 0x50 (delta_batch)        ❌ never appears
Enter fullscreen mode Exit fullscreen mode

Pings arrived. Cursor awareness updates arrived. But the actual document changes? Gone. Swallowed somewhere between the browser and the server.

The client logs confirmed the messages were being sent:

[WS] Calling ws.send() for delta_batch, message size: 1054 bytes (binary)
[WS] ✓ ws.send() completed for delta_batch
Enter fullscreen mode Exit fullscreen mode

So the client was sending. The server wasn't receiving. Only in production. Same code, same binary protocol, different infrastructure.

The first wrong theory

Before this became a two-week saga, there was a simpler bug hiding underneath. The SDK was calling this.emitter.emit('delta', ...) thinking that would send a WebSocket message. It doesn't. EventEmitter fires local events within the process. To actually send data over the wire, you need this.websocket.send().

Classic mix-up. Fixed it, tested locally, everything synced beautifully.

Then I deployed to Fly.io, and the delta_batch messages vanished.

At this point my theory was straightforward: Fly.io must be dropping larger messages. Pings are tiny. Awareness updates are small. Delta batches are bigger. Maybe the WebSocket proxy has a size limit on their free tier.

I was on Fly.io's Legacy Hobby plan, a deprecated free tier. It made sense that a free plan would have limitations. So the fix was obvious: upgrade to the paid plan.

One problem. My debit card had expired.

The 10-day wait

This is the part of the debugging story that nobody writes about. Sometimes the blocker isn’t technical. It’s you.

To upgrade the Fly.io plan, I needed a working debit card. To get a new debit card, I needed to walk to my bank branch. The branch is about 2km from my apartment in Nairobi, and processing takes two to three hours of sitting in a bank.

I kept telling myself I'd go tomorrow. Tomorrow turned into ten days.

During those ten days I worked on other parts of the release. Built features, wrote tests, polished the UI. Productive stuff. But the whole time, this unsolved bug sat in the back of my mind. I was almost certain the plan upgrade would fix it. Almost. And that "almost" was a convenient excuse to keep postponing the bank trip.

I finally went on January 22nd. Got the card. Upgraded to Pay As You Go that same night.

Deployed. Tested. Watched the logs.

Same behavior. Pings arrive. Awareness arrives. Delta batch: nothing.

The plan upgrade didn't fix anything.

"Ask your LLM"

With the paid plan active, I could now contact Fly.io support. I wrote a detailed bug report: here's what works, here's what doesn't, here's the binary protocol we're using, here's proof it works locally but not on your infrastructure.

The response came from Daniel:

This is incorrect. We treat the WS connection as a raw TCP stream. We don't manipulate or drop payloads in any way.
I can suggest two things to try:

  1. Can you ask your LLM to diagnose with the assumption that Fly.io proxy and infrastructure do not filter/drop websocket messages?
  2. Would it be possible to produce a small reproduction case so we can also test it on our side?

"Ask your LLM." That was the suggestion. Not "let me check our proxy logs." Not "here's a diagnostic endpoint you can use." Just... ask your LLM.

To be fair, I should mention: Daniel turned out to be right. Fly.io wasn't dropping messages. But in the moment, getting that response after ten days of procrastination and hours spent writing a detailed bug report felt like being told to Google it.

I was on my own.

The long night

Thursday night into Friday. I had no more theories, just stubbornness.

Attempt 1: Chunking. Maybe large messages get dropped. Split delta_batch into smaller pieces, reassemble on the server. Didn't work. The chunks didn't arrive either. And honestly, I'd already tried this in an earlier session and forgot. I was going in circles.

Attempt 2: Different type codes. Maybe 0x22 (the original delta_batch type code) was somehow problematic. Changed it to 0x50, then other values. Same behavior regardless. Whatever was happening, it wasn't about the content of the message.

Attempt 3: JSON instead of binary. Maybe the binary protocol itself was the issue. Nope. JSON messages didn't arrive either.

Three attempts, three failures. Each one took roughly an hour to implement, deploy, test, and confirm failure. It was past midnight and I had nothing.

Attempt 4: Disable awareness entirely.

This was desperation. If I can't figure out what's wrong with delta_batch, let me start removing other things and see what changes.

I commented out all awareness broadcasts. No cursor positions, no user presence. Just delta_batch on its own.

Delta batch arrived on the server.

I enabled awareness again. Delta batch stopped arriving.

Disabled. Works. Enabled. Broken.

The problem was never about delta_batch. It was about the interaction between delta_batch and awareness messages.

Attempt 5: Two separate WebSocket connections. If they're interfering with each other, give them separate connections. One for awareness, one for deltas.

Made it worse. Neither connection worked properly. Reverted.

Railway proved it wasn't Fly.io

Friday afternoon. I had one remaining question: is this specific to Fly.io?

I deployed the server to Railway. Different platform, different infrastructure, different proxy. If Railway worked, the problem was Fly.io's proxy. If Railway also failed, it was something in our code.

Railway also failed. Same exact behavior.

So Daniel was right. It wasn't Fly.io. It was us. Something in the way we were sending messages broke in production but not locally. And it had to do with awareness messages and delta_batch being sent through the same connection.

What's different between local and production? Network latency. Load balancers. Proxies. TCP buffering.

The fix

Friday evening. Running on fumes.

I went back to a WebSocket API I'd never paid much attention to: bufferedAmount. It tells you how many bytes are sitting in the WebSocket's send buffer, waiting to be transmitted over the network.

When you call ws.send(), the data doesn't immediately leave your machine. It goes into a buffer. The buffer gets flushed when the network is ready to accept it. Locally, this happens almost instantly because there's no real network latency. The buffer fills and empties faster than you can blink.

In production, behind a load balancer and proxy, it takes longer. The buffer holds data for a few milliseconds. Sometimes longer.

Here's what was happening: awareness updates fire 10+ times per second as the cursor moves. Each one fills the send buffer. When a delta_batch message comes in right after an awareness update, the buffer isn't empty yet. The delta_batch goes into the buffer behind the pending awareness data. And somewhere in the network path (the proxy, the load balancer, whatever), that mixed buffer gets mangled. The awareness message arrives fine. The delta_batch doesn't.

Locally, the buffer flushes so fast that messages never overlap. In production, the timing is just slow enough for them to collide.

The fix was seven lines:

if (message.type === 'delta_batch' && this.ws.bufferedAmount > 0) {
  const waitForDrain = () => {
    if (this.ws.bufferedAmount === 0) {
      this.ws.send(encoded)
    } else {
      setTimeout(waitForDrain, 10)
    }
  }
  setTimeout(waitForDrain, 10)
  return
}
Enter fullscreen mode Exit fullscreen mode

Before sending a delta_batch, check if the buffer is empty. If it's not, wait. Poll every 10ms until bufferedAmount hits zero. Then send.

I also throttled awareness updates to a maximum of 10 per second to reduce the buffer pressure.

Rebuilt. Deployed to Railway.

[WS] delta_batch waiting for buffer to drain (bufferedAmount: 179)
[WS] Buffer drained, now sending delta_batch
Enter fullscreen mode Exit fullscreen mode

Server side:

typeCode=0x50, size=274  ← DELTA BATCH ARRIVED
Enter fullscreen mode Exit fullscreen mode

Deployed to Fly.io.

typeCode=0x40, size=179   ← awareness_update ✅
typeCode=0x50, size=274   ← delta_batch ✅
typeCode=0x50, size=8398  ← delta_batch ✅
Enter fullscreen mode Exit fullscreen mode

Both message types arriving. On both platforms. After fifteen days.

What actually went wrong

The root cause, in one sentence: when awareness messages were sitting in the WebSocket send buffer, delta_batch messages sent immediately after would get lost in the network path.

Locally, the buffer flushes instantly. Zero latency means zero overlap. In production, the buffer takes time to drain through proxies and load balancers. Messages collide. And when they collide, the critical one (delta_batch) loses.

The fix is simple. Wait for the buffer to drain. That's it.

What I'd do differently

Check bufferedAmount from the start. I'd never used this API before. Didn't know it existed. If I'd thought about buffer contention on day one, I could have saved two weeks. The WebSocket API has 67 properties and methods. Most of us use about four of them. bufferedAmount should be on everyone's short list.

Test on a real network earlier. "Works locally" meant nothing here. The bug only existed when there was enough network latency for buffers to overlap. I should have tested through a real network from the beginning, even if that meant deploying to a staging environment before I thought the code was ready.

Don't blame the infrastructure first. I spent days convinced Fly.io was dropping messages. Built chunking. Tried different type codes. Wrote a support ticket. All because I assumed the problem was external. It wasn't. The "ask your LLM" response stung, but it also forced me to stop pointing fingers and start looking at my own code.

Track what you've already tried. I attempted chunking twice. JSON encoding twice. When you're debugging the same problem across multiple sessions spanning weeks, you lose track. I started keeping a list of failed approaches. Should have done that from the beginning.

The unglamorous truth

Twenty-plus hours of debugging. Six failed approaches. A 10-day delay because I didn't feel like walking to a bank. A support ticket that told me to ask my LLM. A deployment to an entirely separate platform just to rule out infrastructure.

The fix was checking one property before calling send.

Most debugging stories are like this. The resolution is never proportional to the effort. You don't spend twenty hours on a twenty-hour problem. You spend twenty hours discovering that the problem was small and specific, hiding behind a wall of wrong assumptions.

Somewhere out there, someone is staring at a WebSocket connection that works locally and fails in production. Their messages are disappearing and they can't figure out why. Maybe they're blaming their hosting provider. Maybe they're rewriting their binary protocol. Maybe they're about to deploy to a second platform just to test a theory.

Check bufferedAmount. It might save you two weeks.


SyncKit is an open-source local-first sync SDK using Fugue CRDTs and WebAssembly for conflict-free real-time collaboration. You can try the live demo or check the source on GitHub.

Top comments (0)