Thread-Safe Hyperliquid Ingestion With Real-Time Freshness

Table of Contents

Shipping market data out of Hyperliquid looks simple until the pipes clog. My ingestion service was running inside a single container with three moving parts: a WebSocket listener, a PostgreSQL batch writer, and a NATS publisher. The new monitoring thread I introduced last week caught stalled pipelines in minutes, but it also uncovered a couple of subtle threading bugs and a nasty freshness regression. This post walks through what went wrong, how I fixed it in C++, and why the service now holds a rock-solid one second stale-data delta.

Losing the Race Against Myself
#

Two independent threads were trying to keep the system alive:

The monitoring thread aggressively reconnected to dependencies when it spotted unhealthy flow.
The database worker thread churned through queued order book data and published to Postgres and NATS.

I reused the single db_conn_ and nats_conn_ pointers across threads without coordination. As soon as the monitor decided to reconnect, it destroyed whichever connection the worker was actively using. That left me with dangling pointers and sporadic crashes under load.

The fix was to treat every shared connection like a critical section. I wrapped both pointers with dedicated std::mutex guards and funneled every reconnect through them. The core pattern now looks like:

std::lock_guard<std::mutex> lock(db_mutex_);
if (db_conn_) {
    PQfinish(static_cast<PGconn*>(db_conn_));
}
db_conn_ = conn;

Any thread that needs to dereference the connection must first acquire the same mutex. On the publishing side, I added a scoped std::unique_lock so I can momentarily release it while attempting a reconnect, then re-lock to swap in the new handle safely. The same structure surrounds nats_conn_ inside connectToNats() and in the retry loop inside publishOrderBookEntry.

The takeaway: once reconnections move off the main thread, sharing raw pointers without locks becomes undefined behavior. The mutexes give me deterministic ownership and eradicated the crash reports I saw in staging.

Dead-Man Switch That Actually Bites
#

Another oversight surfaced once the monitor thread was in place. During recovery I reset every health timestamp to now, unintentionally blinding the 10-minute dead-man switch. Permanent outages never tripped the failsafe, so Kubernetes kept the container alive forever.

Simply removing the timestamp reset brought the guard dog back online. Whenever the WebSocket, DB, or publisher stops moving for ten minutes, signalCriticalFailure() now fires, my main loop surfaces the error, and the container restarts. Real failure now provokes real action.

Measuring the Win
#

Once the race conditions were gone, I focused on freshness. The stale data checker tails the latest insert timestamp and compares it to wall-clock time every minute. Before these fixes the deltas could spike well above 60 seconds; in the worst case the monitor itself caused back pressure.

Here is an excerpt from the latest stale_data_check.log:

2025-10-27 11:03:01 - Time difference: 1 seconds (threshold: 300 seconds)
2025-10-27 11:04:01 - Time difference: 1 seconds (threshold: 300 seconds)
2025-10-27 11:05:01 - Time difference: 1 seconds (threshold: 300 seconds)
2025-10-27 11:06:01 - Time difference: 1 seconds (threshold: 300 seconds)

I now sit comfortably at ~1 second of staleness round the clock. That stability comes from eliminating double reconnects, keeping connections hot, and preventing the worker from retrying on corrupted handles. With deterministic access to Postgres and NATS, the writer thread stays in lockstep with incoming books.

What Comes Next
#

This round hardened the hot path, but I still want visibility when retries succeed after the first attempt. The next iteration will push counters into Prometheus so I can alert on reconnect storms before they hurt latency.

For now, the ingestion service is both safer and faster: threads cooperate instead of fighting, the dead-man switch keeps its bite, and my stale data delta stays pinned at one second. That is the kind of boring reliability you want from a market data pipeline. Besides, I want to open source my ingestion engine/service on my github, to learn from others and to show others my ongoing development, though it’s just a component of a algorithmic trading system.

Losing the Race Against Myself#

Dead-Man Switch That Actually Bites#

Measuring the Win#

What Comes Next#

Losing the Race Against Myself
#

Dead-Man Switch That Actually Bites
#

Measuring the Win
#

What Comes Next
#