This post is a confession about multiple overlapping mistakes we made when trying to measure API request times. These errors went unnoticed for months because wrong statistics still look realistic. Hopefully, this post will help you avoid such fails.
What we are trying to do
We are making multiple parallel API requests to the external service. These are similar requests to the same service but with different input. We want to measure the time of each API call and show the statistics on a dashboard.
How we show data
Running a lot of API requests in parallel means we have hundreds of data items each minute. We can't show them all it a graph, so we need to perform some aggregation to extract meaning from data.
Fail: aggregating durations with average function
Usually, when we want to aggregate multiple values into one, we think about the
avg function. And usually that's a bad idea because
avg hides a lot of information.
A good example of this is Anscombe's quartet. It's a group of four very different data sets that have the same descriptive statistics like average values and variances of
y or correlations between them.
Anscombe's quartet illustration from Wikipedia
The better way of aggregating a bunch of durations is percentiles. Percentile shows the duration in which some percentage of durations fit. For example, 90th percentile of
600ms shows that 90% of requests finish within 600 milliseconds. 50th percentile is median value. 100th percentile is maximum value.
We can show multiple values on our graph: average, 50th percentile, 90th percentile, 95th percentile, 100th percentile. That would give us better understanding on what's happening.
Fail: not handling periods with no data
How would our graph look when we suddenly stop making API requests and sending metrics? That depends on the graph configuration in Grafana.
It might duplicate old values to make sure there's no gap. Such a graph would distort reality. Instead, we want to ensure the graph lines fall to zero or terminate so we can see if something is wrong. For this, we set the value of the "Null value" panel option to something else from
How we store data
We record hundreds of data points to Prometheus, but it does not store them all. Instead, it uses a histogram metric type. It stores several buckets for different value ranges and tracks the occurrence frequency for each bucket.
For example, if our bucket configuration looks like
1s, 3s, 5s, that would mean we have four buckets:
>=5s. When we record
2.5s, that counts as one more occurrence for the
Such data structure greatly reduces the amount of data we need to store, but that hits data granularity. In histogram both durations
1s002ms would looks like
3s. We must keep that in mind. Otherwise, you would, like me, wonder why we have 7-second durations when our timeout is 5 seconds.
Fail: using default histogram buckets
It makes sense to check the buckets in Prometheus to make sure they match your value range.
In our case they didn't. This could be fixed with Kamon config.
How we measure data
Well, this one is a bit awkward…
We take a list of ids, iterate it in parallel with API requests, and then collect everything back into a list of responses. If we put the measuring code outside of the iteration, it would log some data, but it would log a single value (the biggest one) instead of hundreds.
Fail: measuring time outside the loop
Remember this "we have hundreds of data items"? Turns out this assumption was wrong.
In this case Grafana is lying in a fascinating way: when we have a single value, the 80% percentile would show this value multiplied by 80%. So you would still see different values for each percentile, but they would look strangely correlated.
The common issue with these fails is that it is not obvious that something is wrong. You can still see some data in a graph and it looks realistic. So be careful when you're working with reporting logic.
And sometimes, when the metric says that the tool is broken, actually the metric is broken.