How we cut p99 latency by 80% with one trace
We’d been chasing p99 latency on the checkout service for two sprints. The dashboard showed the same red bar week after week. We tried the usual things — bigger connection pools, faster JSON encoders, a new Protobuf schema. None of it moved the number.
Then an engineer added one OpenTelemetry span around a payment-provider call and the picture changed.
The trace that made it obvious
HTTP POST /checkout 1.20s
├── validate cart 0.01s
├── charge payment 1.18s
│ ├── build request body 0.01s
│ ├── HTTP POST provider 1.15s ◄── all of it
│ └── parse response 0.02s
└── persist order 0.01s
The entire 1.15 seconds was waiting for the upstream provider — and the worst part was that we were already doing the right thing most of the time. Ninety percent of requests returned in 80ms. The p99 was being driven by a thin tail of slow ones, and the tail was bounded by our HTTP client timeout.
The fix
We had set the timeout to 2 seconds. The provider’s actual p99 was 1.1 seconds. So a small fraction of legitimate slow requests were bumping into our timeout, getting retried internally, and only finally surfacing as “timeouts” once we’d already doubled the latency.
The fix was a five-line change to the HTTP client config: lower the timeout to 1.2 seconds, fail fast, and let the caller decide whether to retry.
httpClient := &http.Client{
Timeout: 1200 * time.Millisecond, // was 2 * time.Second
Transport: otelhttp.NewTransport(http.DefaultTransport),
}
P99 dropped from 1.2s to 220ms within the hour.
What I took from it
- A trace beats a dashboard. Dashboards show you the symptom; a trace shows you the path. We had been optimising the wrong layer.
- The slowest span is almost always a network boundary. If your service is fast in isolation and slow in production, look at the spans that cross your process.
- p99 is a tail, not an average. When you fix the tail, you usually find that you weren’t really fixing an average to begin with — you were hiding a class of failures behind it.