Life of a Metric
A metric provides value when it is aggregated — an individual metric is entirely absorbed in the aggregation and loses its identity in the process.
Each observability platform behaves a little differently around where the aggregation happens, specifically with Prometheus the aggregation happens in both the client library, as well as in Prometheus itself.
Consider a simple metric that I want my service to report on — time taken to respond to a HTTP request. My setup for this flow looks like this:
If I were to now send one HTTP request to the service, a look at the scrape endpoint exposed by the service shows this detail:
http_server_requests_seconds_count{application="sample-caller",method="POST",status="200",uri="/caller/messages",} 1.0
http_server_requests_seconds_sum{application="sample-caller",method="POST",status="200",uri="/caller/messages",} 0.724265875
As a quick aside, scrape end point is an HTTP endpoint that every service that expects to be monitored by Prometheus exposes, prometheus periodically polls this endpoint for the latest metric that the service has collected.
My latency metric was called http_server_requests_seconds
but magically two values with different suffixes _count
and _sum
have shown up in the scrape endpoint which shows the first level of aggregation inside the application and is typically done by instrumentation that is performed inside of the application itself:
http_server_requests_seconds_count
is the count of requests since the service has started, and will only go up.
http_server_requests_seconds_sum
is the sum of latency since the service has started, and again the value for the service will only keep going up
The individual metric is no more though, it has been translated into these two aggregated values. So after 9 more requests, the metric numbers look like this:
http_server_requests_seconds_count{application="caller",method="POST",outcome="SUCCESS",status="200",uri="/caller/messages",} 10.0
http_server_requests_seconds_sum{application="caller",method="POST",outcome="SUCCESS",status="200",uri="/caller/messages",} 2.074026123
More Aggregations
Once the data ends up in a observability system like prometheus, the Sum and count by itself is also not that useful, it is just an aggregated number and would always increase.
So what is useful is delta’s over a period of time(slope of the above curve). In prometheus, this is using a rate function and looks like this:
rate(http_server_requests_seconds_count{application="caller", uri="/caller/messages",method="POST"}[1m])
And a little more complex one, to calculate the average response time, an expression that combines _sum
and _count
can be used:
sum(rate(http_server_requests_seconds_sum{application="caller",uri="/caller/messages"}[1m]))/sum(rate(http_server_requests_seconds_count{application="caller",uri="/caller/messages"}[1m]))
Even More Aggregations
An important calculation for latency is getting the percentile numbers. I would love to know what the 95th, 99th percentile latency for my sample request is. Given that Prometheus does not get individual datapoints and only has aggregations at different points in time, it is impossible to calculate percentile numbers given just the aggregated latency data. Instead to calculate percentile, another metric type is provided by Prometheus, a histogram.
The way the histogram works is very similar to a counter, just that the counters are bucketed in different latency ranges(latency ranges again is left to the client library, Micrometers logic is here — https://docs.micrometer.io/micrometer/reference/concepts/histogram-quantiles.html
So for my sample, using the excellent Micrometer library, histogram metric can be sent easily, all that needs to be done for a Spring Boot application using micrometer is to set this property to true:
management.metrics.distribution.percentiles-histogram.http.server.requests: true
With this set, the following bucketed counters start showing up in the scrape endpoint:
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.089478485",} 11.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.111848106",} 16.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.134217727",} 17.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.156587348",} 17.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.178956969",} 18.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.20132659",} 18.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.223696211",} 18.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.246065832",} 18.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.268435456",} 18.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.357913941",} 30.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.447392426",} 32.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.536870911",} 33.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.626349396",} 33.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.715827881",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.805306366",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.894784851",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.984263336",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="1.073741824",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="1.431655765",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="1.789569706",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="2.147483647",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="2.505397588",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="2.863311529",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="3.22122547",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="3.579139411",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="3.937053352",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="4.294967296",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="5.726623061",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="7.158278826",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="8.589934591",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="10.021590356",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="11.453246121",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="12.884901886",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="14.316557651",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="15.748213416",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="17.179869184",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="22.906492245",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="28.633115306",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="30.0",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="+Inf",} 34.0
Looking at one of these buckets:
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.134217727",} 17.0
All it means is that there are 17 requests less than or equal to “0.13” seconds. Now, to get a percentile approximation based on this data, the promql function to use looks like this:
histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket{ application="caller", status!~"5..", uri="/caller/messages"}[1m])) by (le))
Using this here is a chart of average and 99 percentile latency plotted over time:
Conclusion
A metric once ingested into an observability platform like Prometheus loses its individuality and is aggregated into other forms. Aggregated numbers can then be queried in different ways to provide actionable data, in the examples in this post to get the rate of requests over time, to calculate the percentile values.
References:
- Details of histogram metric type in Prometheus: https://prometheus.io/docs/concepts/metric_types/#histogram
- An excellent post on how percentile based bucketing works for Spring boot: https://coderstower.com/2022/05/30/spring-boot-observability-validating-tail-latency-with-percentiles/