Incident retrospective: the cache stampede of April 14

Summary

On April 14, between 14:02 and 14:38 UTC, the homepage recommendations endpoint served 5xx for 36 minutes. Customer impact: roughly 12% of homepage requests during the window. No data loss.

Timeline

Time (UTC)Event
14:02First 5xx alert. On-call paged.
14:04Engineer acknowledges. Initial hypothesis: deploy at 13:58.
14:11Reverted the deploy. No change.
14:14Engineer notices recommendations service CPU at 100% across all 16 pods.
14:19Doubled pod count. Latency improves slightly. Error rate still climbing.
14:23Found a Redis key with no TTL. It was 14 GB.
14:25Manually deleted the key. Service recovers within 90 seconds.
14:38All-clear declared.

What actually happened

A cron job had been writing a single Redis key as part of a feature flag rollout two months earlier. The key had no TTL. As it grew past 10 GB, the Redis node started doing synchronous keyspace scans during the recommendations service’s read path. The deploy at 13:58 was a red herring — it just happened to be the trigger that pushed the read rate over the threshold.

What went well

  • The dashboard was good. Three different views (CPU, latency, error rate) on one screen, all using the same time range. We didn’t waste time aligning axes.
  • The revert was fast. Two minutes from “this looks like the deploy” to “revert shipped.” Even though it wasn’t the cause, it ruled out a class of failure and bought us time.
  • The postmortem is blameless. Everyone involved had done exactly the reasonable thing at each step. The bug was a missing TTL, not a missing human.

What didn’t

  • The alert fired late. The 5xx alert triggered when error rate crossed 5%. By then, customers had been impacted for ~30 seconds. We could have detected this from the Redis memory growth a week earlier.
  • The on-call had to do too much manual work. Finding the big key required redis-cli --bigkeys, which isn’t in the standard runbook. The post-incident action is to add a “list top 10 keys by size” panel to the Redis dashboard.
  • I didn’t loop in the right person fast enough. The owner of the feature that wrote the key was on the team. They had context I didn’t — including a hunch, eventually confirmed, that the cron job was the culprit. I should have pulled them in at 14:11, not 14:30.

Action items

  1. Add Redis memory-by-key panel to the dashboard (owner: TBD, due: this week).
  2. Write a CI check that fails any PR introducing SET without EX (owner: me, due: this month).
  3. Add a “10% of pods at 100% CPU” alert that’s noisier than the 5xx alert but fires earlier (owner: TBD, due: this month).
  4. Update the runbook to include the redis-cli --bigkeys command (owner: me, due: today).

← Back to all writing