Life of a Metric

Biju Kunjummen
4 min readMay 25, 2024

--

A metric provides value when it is aggregated — an individual metric is entirely absorbed in the aggregation and loses its identity in the process.

Each observability platform behaves a little differently around where the aggregation happens, specifically with Prometheus the aggregation happens in both the client library, as well as in Prometheus itself.

Consider a simple metric that I want my service to report on — time taken to respond to a HTTP request. My setup for this flow looks like this:

If I were to now send one HTTP request to the service, a look at the scrape endpoint exposed by the service shows this detail:

http_server_requests_seconds_count{application="sample-caller",method="POST",status="200",uri="/caller/messages",} 1.0
http_server_requests_seconds_sum{application="sample-caller",method="POST",status="200",uri="/caller/messages",} 0.724265875

As a quick aside, scrape end point is an HTTP endpoint that every service that expects to be monitored by Prometheus exposes, prometheus periodically polls this endpoint for the latest metric that the service has collected.

My latency metric was called http_server_requests_seconds but magically two values with different suffixes _count and _sum have shown up in the scrape endpoint which shows the first level of aggregation inside the application and is typically done by instrumentation that is performed inside of the application itself:

http_server_requests_seconds_count is the count of requests since the service has started, and will only go up.

http_server_requests_seconds_sum is the sum of latency since the service has started, and again the value for the service will only keep going up

The individual metric is no more though, it has been translated into these two aggregated values. So after 9 more requests, the metric numbers look like this:

http_server_requests_seconds_count{application="caller",method="POST",outcome="SUCCESS",status="200",uri="/caller/messages",} 10.0
http_server_requests_seconds_sum{application="caller",method="POST",outcome="SUCCESS",status="200",uri="/caller/messages",} 2.074026123

More Aggregations

Once the data ends up in a observability system like prometheus, the Sum and count by itself is also not that useful, it is just an aggregated number and would always increase.

So what is useful is delta’s over a period of time(slope of the above curve). In prometheus, this is using a rate function and looks like this:

rate(http_server_requests_seconds_count{application="caller", uri="/caller/messages",method="POST"}[1m])

And a little more complex one, to calculate the average response time, an expression that combines _sum and _count can be used:

sum(rate(http_server_requests_seconds_sum{application="caller",uri="/caller/messages"}[1m]))/sum(rate(http_server_requests_seconds_count{application="caller",uri="/caller/messages"}[1m]))

Even More Aggregations

An important calculation for latency is getting the percentile numbers. I would love to know what the 95th, 99th percentile latency for my sample request is. Given that Prometheus does not get individual datapoints and only has aggregations at different points in time, it is impossible to calculate percentile numbers given just the aggregated latency data. Instead to calculate percentile, another metric type is provided by Prometheus, a histogram.

The way the histogram works is very similar to a counter, just that the counters are bucketed in different latency ranges(latency ranges again is left to the client library, Micrometers logic is herehttps://docs.micrometer.io/micrometer/reference/concepts/histogram-quantiles.html

So for my sample, using the excellent Micrometer library, histogram metric can be sent easily, all that needs to be done for a Spring Boot application using micrometer is to set this property to true:

management.metrics.distribution.percentiles-histogram.http.server.requests: true

With this set, the following bucketed counters start showing up in the scrape endpoint:

http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.089478485",} 11.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.111848106",} 16.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.134217727",} 17.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.156587348",} 17.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.178956969",} 18.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.20132659",} 18.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.223696211",} 18.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.246065832",} 18.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.268435456",} 18.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.357913941",} 30.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.447392426",} 32.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.536870911",} 33.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.626349396",} 33.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.715827881",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.805306366",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.894784851",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.984263336",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="1.073741824",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="1.431655765",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="1.789569706",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="2.147483647",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="2.505397588",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="2.863311529",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="3.22122547",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="3.579139411",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="3.937053352",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="4.294967296",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="5.726623061",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="7.158278826",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="8.589934591",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="10.021590356",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="11.453246121",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="12.884901886",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="14.316557651",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="15.748213416",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="17.179869184",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="22.906492245",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="28.633115306",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="30.0",} 34.0
http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="+Inf",} 34.0

Looking at one of these buckets:


http_server_requests_seconds_bucket{application="caller",status="200",uri="/caller/messages",le="0.134217727",} 17.0

All it means is that there are 17 requests less than or equal to “0.13” seconds. Now, to get a percentile approximation based on this data, the promql function to use looks like this:

histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket{ application="caller", status!~"5..", uri="/caller/messages"}[1m])) by (le))

Using this here is a chart of average and 99 percentile latency plotted over time:

Conclusion

A metric once ingested into an observability platform like Prometheus loses its individuality and is aggregated into other forms. Aggregated numbers can then be queried in different ways to provide actionable data, in the examples in this post to get the rate of requests over time, to calculate the percentile values.

References:

--

--

Biju Kunjummen
Biju Kunjummen

Written by Biju Kunjummen

Sharing knowledge about Java, Cloud and general software engineering practices

No responses yet