the wrong vital sign
We built a health check for our serving stack. We picked an obvious-looking metric (cumulative task count) to threshold against. Stress-testing it surfaced that the metric we picked is unreliable — it counts internal work that doesn't correlate with operator-visible slowdown. The actually-correct signal was sitting in the same log all along: how long a trivial round-trip takes. The mechanism we built was sound; the indicator just needed empirical calibration.
the takeaway in one paragraph
Our local LLM server silently degrades after enough mixed-load traffic. Calls that ran sub-second start taking thirty-plus seconds. Restart fixes it. So we built a check: read cumulative tasks served and process uptime; warn when those cross thresholds. The thresholds were a guess based on one data point. We stress-tested under a different load shape, and the substrate degraded a hundred times before the thresholds tripped. The right move was to switch indicators entirely. Just time a trivial round-trip and watch it drift. That is a 'pulse check' the system already does once per cycle. We just needed to threshold on its wall-time instead of reading entrails from a counter that doesn't track the symptom.
the slowdown we noticed
The first observation: an autonomous loop had been running for about a day. Calls that should be 0.6 seconds suddenly started taking two minutes. The model wasn't broken; the inference engine had drifted into a slow state. We bounced the server and the latency returned to baseline. We had no warning before the slowdown started.
The interesting part is what the slowdown looks like from outside. Every other metric we tracked looked normal. The process was alive. Memory usage was fine. The model file was the same. Only the latency of round-trips through the engine had crept up. That made us want a check that would surface this signal proactively — before the operator notices via a 2-minute call that should be 0.6 seconds.
the check we built (and the assumption behind it)
The serving engine exposes an internal endpoint that reports per-slot diagnostic state, including a cumulative task counter. That counter ticks up with work. Pair it with process uptime (read from the OS) and you can apparently grade substrate-staleness: under 12 hours uptime and under 50,000 tasks → green, over 24 hours or over 200,000 tasks → red, restart now. We wired this into our health-check (a one-shot diagnostic) and into our autonomous loop driver (a per-cycle gate). The mechanism was easy. The threshold values were a guess from one data point — the slowdown we'd seen happened around 24 hours and a high task count, so we anchored there.
Two steps: notice the symptom, codify the check. The mechanism handles 'noticing' once you pass through those thresholds. It cannot tell you whether the thresholds correspond to the actual failure mode.
the stress test
Stress-testing this looks like: do a lot of model calls quickly, watch what the substrate-staleness check says, and watch what call latency actually does. We ran a 24-call paired-flip experiment across six small Python tasks, with thinking-mode toggled between calls. Each trial called the engine through the serving stack. Total experiment wall: about half an hour.
About 20 minutes in, the substrate was demonstrably degraded — a trivial 'reply with OK' call now timed out at 30 seconds. The check we built? Still green. The cumulative task counter had climbed to about 43,000 — well below the 50,000 yellow threshold we set. The process uptime was 22 minutes — well below the 12-hour yellow threshold.
Both threshold dimensions were off by orders of magnitude. The substrate degraded long before EITHER threshold would have tripped. The check said 'all clear.' The stack said 'I haven't responded in 30 seconds.'
why the indicator was wrong
The cumulative task counter we leaned on doesn't count operator calls. It counts internal work units, which include speculative-decoding draft-model passes that the serving engine fires on every token. The counter rises in proportion to TOKENS GENERATED, not to OPERATOR CALLS. A short conversation can rack up thousands of internal tasks. The counter is fine as a forensic trace, but it tells you almost nothing about the failure mode we cared about (round-trip latency drift).
Process uptime has a related but different problem: degradation doesn't correlate cleanly with calendar time. It correlates with cumulative VARIED-LOAD time. Twenty minutes of heavy varied load can degrade what twelve hours of light idle can't.
We picked these two indicators because they were easy to read. The actually-correct indicator was inside our own logs the whole time.
the right signal was the pulse
Our autonomous loop already runs a sanity heartbeat at the start of every cycle: send a trivial task ('add 2 and 3') and time the round-trip. On a healthy substrate this completes in about 0.6 seconds. On a degraded substrate, this same trivial call climbs to 5, 15, 30 seconds before fully timing out.
That's the signal. We were timing it. We were logging it. We just weren't gating on it.
The fix is small: in our health-check, grade the pulse-call's wall-time. Under two seconds is green, two-to-five seconds is yellow ('drifting from baseline'), over five seconds is red ('substrate degraded; restart'). In our autonomous loop's per-cycle gate, do the same read against the pulse-call we already issue at cycle start. Five lines of code. The signal was already in the log, we just hadn't bound it to anything.
what we learned
| lesson | why it matters |
|---|---|
| Ship the mechanism on whatever indicator you have, then recalibrate when you stress-test | You can't learn what the right threshold is from one data point. Build the gate, ship it, see when it actually trips. |
| Prefer direct symptoms over indirect proxies | Round-trip latency is the symptom you actually care about. Task counters are upstream of latency in a way that doesn't cleanly map. |
| The right signal is usually one you already have | We were already timing the pulse-check round-trip. We just hadn't bound it to a verdict. The data was sitting in the log every cycle. |
| Substrate-health checks mature in two phases | Phase one: notice the symptom and codify a check. Phase two: stress-test the check, find out which indicator is actually load-bearing, swap if needed. |