Optimizing Socket.IO for 100k Concurrent Users: A Survival Guide

Socket.IO is the "Hello World" of real-time web. It’s accessible, it has a great API, and it works perfectly... until it doesn't.

For Winkr, "doesn't" happened at exactly 12,500 Concurrent Users (CCU).
One Saturday night, our CPU usage spiked to 100%, latency jumped to 4 seconds, and our servers started playing "packet pinball," dropping connections randomly. We were victims of our own success.

We spent the next 72 hours rewriting our core infrastructure. Here is the technical post-mortem of how we scaled Node.js and Socket.IO to handle 100k+ users without melting.

1. The Redis Adapter Bottleneck (The Silent Killer)

Socket.IO allows you to scale horizontally by adding more servers. To make them talk to each other, you use the Redis Adapter. It uses Redis Pub/Sub to broadcast events.

The problem: By default, the adapter broadcasts too much.
If User A (on Server 1) sends a message to Room X, Redis publishes that message to Server 2, Server 3, ... Server 50.
Even if Server 50 has zero users in Room X, it still receives the packet, deserializes it, checks its list of sockets, and realizes "Oh, I don't need this."

The Fix: Sharded Redis & Custom Routing
We stopped using the default adapter. We implemented a custom Sharding Layer.
We hash the Room ID to a specific Redis Cluster.
ClusterID = hash(RoomID) % TotalClusters
Now, messages for "Room X" only go to the specific Redis nodes that care about "Room X." This reduced our internal Redis traffic by 94%.

2. Sticky Sessions & The Handshake Dance

Socket.IO is not just WebSockets. It starts with HTTP Long-Polling and upgrades to WebSockets. This requires the client to hit the exact same server for multiple requests in a row.

If you use a standard Round-Robin Load Balancer (like AWS ALB default), Request 1 goes to Server A, and Request 2 goes to Server B. Server B says: "Who are you? I don't know this Session ID." Connection drops.

The Fix: NGINX IP Hashing
We moved to NGINX with strict ip_hash balancing.
upstream socket_nodes { ip_hash; server node1:3000; server node2:3000; }
This ensures User A always talks to Server A. It sounds basic, but it is the #1 reason Socket.IO fails in production.

3. Binary Packing (MsgPack > JSON)

JSON is expensive. It is text-based. It is verbose.
Sending {"type": "offer", "sdp": "..."} wastes bytes on quotes, brackets, and field names.

We switched our internal serialization protocol to MsgPack.
MsgPack is binary. It compresses integers, booleans, and strings into the smallest possible byte sequence.
Result:
• Bandwidth usage: -45%.
• CPU time spent parsing JSON: -60%.
When you are paying for data transfer (AWS Egress fees), this change literally saved us $3,000/month.

4. Tuning for Mobile Networks (The Heartbeat)

Users on 4G/5G in moving trains drop packets constantly. The default Socket.IO ping settings (Timeout: 5s) are too aggressive. They disconnect users who are just going through a tunnel.

The Fix: Relaxed Heartbeats
We changed our config:
pingInterval: 25000 (25s)
pingTimeout: 20000 (20s)
This makes the server more tolerant of temporary packet loss. It reduced "ghost disconnects" by 30%.

5. Linux Kernel Tuning (ulimit is a lie)

Out of the box, Linux allows a process to open 1,024 files. In Linux, a socket is a file.
So at User #1,025, your server crash loops with EMFILE: too many open files.

You run ulimit -n 65535. You think you are safe. You are not.
You also need to tune the Ephemeral Port Range and TCP Buffer Sizes.
We edited /etc/sysctl.conf:
net.ipv4.ip_local_port_range = 1024 65535
net.core.somaxconn = 65535 (Increase backlog)
Without these OS-level tweaks, your Node.js code doesn't matter. The kernel will kill you first.

Conclusion: Assume Infinite Scale

Optimization isn't something you do once. It is a continuous war against entropy.
Today, Winkr handles 100k+ concurrents with sub-100ms latency. But we know that at 1 million users, everything will break again. And we can't wait to fix it.