How we cut p99 latency by 80% with one trace

We’d been chasing p99 latency on the checkout service for two sprints. The dashboard showed the same red bar week after week. We tried the usual things — bigger connection pools, faster JSON encoders, a new Protobuf schema. None of it moved the number.

Then an engineer added one OpenTelemetry span around a payment-provider call and the picture changed.

The trace that made it obvious

HTTP POST /checkout                 1.20s
├── validate cart                   0.01s
├── charge payment                  1.18s
│   ├── build request body          0.01s
│   ├── HTTP POST provider          1.15s  ◄── all of it
│   └── parse response              0.02s
└── persist order                   0.01s

The entire 1.15 seconds was waiting for the upstream provider — and the worst part was that we were already doing the right thing most of the time. Ninety percent of requests returned in 80ms. The p99 was being driven by a thin tail of slow ones, and the tail was bounded by our HTTP client timeout.

The fix

We had set the timeout to 2 seconds. The provider’s actual p99 was 1.1 seconds. So a small fraction of legitimate slow requests were bumping into our timeout, getting retried internally, and only finally surfacing as “timeouts” once we’d already doubled the latency.

The fix was a five-line change to the HTTP client config: lower the timeout to 1.2 seconds, fail fast, and let the caller decide whether to retry.

httpClient := &http.Client{
    Timeout: 1200 * time.Millisecond, // was 2 * time.Second
    Transport: otelhttp.NewTransport(http.DefaultTransport),
}

P99 dropped from 1.2s to 220ms within the hour.

What I took from it

A trace beats a dashboard. Dashboards show you the symptom; a trace shows you the path. We had been optimising the wrong layer.
The slowest span is almost always a network boundary. If your service is fast in isolation and slow in production, look at the spans that cross your process.
p99 is a tail, not an average. When you fix the tail, you usually find that you weren’t really fixing an average to begin with — you were hiding a class of failures behind it.