Incident retrospective: the cache stampede of April 14
Summary
On April 14, between 14:02 and 14:38 UTC, the homepage recommendations endpoint served 5xx for 36 minutes. Customer impact: roughly 12% of homepage requests during the window. No data loss.
Timeline
| Time (UTC) | Event |
|---|---|
| 14:02 | First 5xx alert. On-call paged. |
| 14:04 | Engineer acknowledges. Initial hypothesis: deploy at 13:58. |
| 14:11 | Reverted the deploy. No change. |
| 14:14 | Engineer notices recommendations service CPU at 100% across all 16 pods. |
| 14:19 | Doubled pod count. Latency improves slightly. Error rate still climbing. |
| 14:23 | Found a Redis key with no TTL. It was 14 GB. |
| 14:25 | Manually deleted the key. Service recovers within 90 seconds. |
| 14:38 | All-clear declared. |
What actually happened
A cron job had been writing a single Redis key as part of a feature flag rollout two months earlier. The key had no TTL. As it grew past 10 GB, the Redis node started doing synchronous keyspace scans during the recommendations service’s read path. The deploy at 13:58 was a red herring — it just happened to be the trigger that pushed the read rate over the threshold.
What went well
- The dashboard was good. Three different views (CPU, latency, error rate) on one screen, all using the same time range. We didn’t waste time aligning axes.
- The revert was fast. Two minutes from “this looks like the deploy” to “revert shipped.” Even though it wasn’t the cause, it ruled out a class of failure and bought us time.
- The postmortem is blameless. Everyone involved had done exactly the reasonable thing at each step. The bug was a missing TTL, not a missing human.
What didn’t
- The alert fired late. The 5xx alert triggered when error rate crossed 5%. By then, customers had been impacted for ~30 seconds. We could have detected this from the Redis memory growth a week earlier.
- The on-call had to do too much manual work. Finding the big key
required
redis-cli --bigkeys, which isn’t in the standard runbook. The post-incident action is to add a “list top 10 keys by size” panel to the Redis dashboard. - I didn’t loop in the right person fast enough. The owner of the feature that wrote the key was on the team. They had context I didn’t — including a hunch, eventually confirmed, that the cron job was the culprit. I should have pulled them in at 14:11, not 14:30.
Action items
- Add Redis memory-by-key panel to the dashboard (owner: TBD, due: this week).
- Write a CI check that fails any PR introducing
SETwithoutEX(owner: me, due: this month). - Add a “10% of pods at 100% CPU” alert that’s noisier than the 5xx alert but fires earlier (owner: TBD, due: this month).
- Update the runbook to include the
redis-cli --bigkeyscommand (owner: me, due: today).