Metrics type in Prometheus

From the perspective of storage, all monitoring indicators are the same, but there are some subtle differences in these metric s in different scenarios. For example, the indicator node_load1 in the sample returned by Node Exporter reflects the current system load status, and the sample data returned by this indicator is constantly changing as time changes. However, the sample data obtained by the indicator node_cpu is different. It is a continuously increasing value because it reflects the cumulative usage time of the CPU. Theoretically speaking, as long as the system is not shut down, this value will increase infinitely.

In order to help users understand and distinguish the differences between these different monitoring indicators, Prometheus defines four different metric types: Counter (counter), Gauge (dashboard), Histogram (histogram), Summary (summary ).

In the sample data returned by Exporter, the type of the sample is also included in the comment. E.g:

# HELP node_cpu Seconds the cpus spent in each mode.
# TYPE node_cpu counter
node_cpu{cpu="cpu0",mode="idle"} 362812.7890625

Counter: A counter that only increases but does not decrease

Indicators of the Counter type work in the same way as a counter, which only increases and does not decrease (unless the system is reset). Common monitoring indicators, such as http_requests_total and node_cpu are Counter-type monitoring indicators. It is generally recommended to use _total as the suffix when defining the name of the Counter type indicator.

Counter is a simple but powerful tool. For example, we can record the number of occurrences of certain events in the application. By storing these data in the form of time series, we can easily understand the changes in the rate of occurrence of the event. PromQL's built-in aggregation operations and functions allow users to further analyze these data:

For example, use the rate() function to obtain the growth rate of HTTP requests:

rate(http_requests_total[5m])

Query the top 10 HTTP addresses in the current system:

topk(10, http_requests_total)

Gauge: a dashboard that can be increased or decreased

Unlike Counter, Gauge type indicators focus on reflecting the current state of the system. Therefore, the sample data of such indicators can be increased or decreased. Common indicators such as: node_memory_MemFree (the current idle content size of the host), node_memory_MemAvailable (available memory size) are all monitoring indicators of the Gauge type.

Through Gauge indicators, users can directly view the current state of the system:

node_memory_MemFree

For monitoring indicators of the Gauge type, the change of the sample within a period of time can be obtained through the PromQL built-in function delta(). For example, to calculate the difference in CPU temperature over two hours:

delta(cpu_temp_celsius{host="zeus"}[2h])

You can also use deriv() to calculate the linear regression model of the sample, or even directly use predict_linear() to predict the changing trend of the data. For example, to predict how much system disk space will remain after 4 hours:

predict_linear(node_filesystem_free{job="node"}[1h], 4 * 3600)

Use Histogram and Summary to analyze data distribution

In addition to the monitoring indicators of Counter and Gauge types, Prometheus also defines the indicator types of Histogram and Summary. Histogram and Summary are mainly used for statistics and analysis of the distribution of samples.

In most cases, people tend to use the average value of certain quantitative indicators, such as the average CPU usage and the average response time of the page. The problem with this method is obvious. Take the average response time of system API calls as an example: if most API requests are maintained within the response time range of 100ms, and the response time of individual requests takes 5s, it will cause some WEB The response time of the page falls to the median, and this phenomenon is called the long tail problem.

In order to distinguish between average slowness and long-tail slowness, the easiest way is to group requests according to the range of request delays. For example, count the number of requests with a delay between 0 and 10ms and the number of requests between 10 and 20ms. In this way, the cause of system slowness can be quickly analyzed. Both Histogram and Summary are designed to solve such problems. Through the monitoring indicators of the Histogram and Summary types, we can quickly understand the distribution of monitoring samples.

For example, the indicator type of the indicator prometheus_tsdb_wal_fsync_duration_seconds is Summary. It records the processing time of wal_fsync processing in Prometheus Server. By accessing the /metrics address of Prometheus Server, the following monitoring sample data can be obtained:

# HELP prometheus_tsdb_wal_fsync_duration_seconds Duration of WAL fsync.
# TYPE prometheus_tsdb_wal_fsync_duration_seconds summary
prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.5"} 0.012352463
prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.9"} 0.014458005
prometheus_tsdb_wal_fsync_duration_seconds{quantile="0.99"} 0.017316173
prometheus_tsdb_wal_fsync_duration_seconds_sum 2.888716127000002
prometheus_tsdb_wal_fsync_duration_seconds_count 216

From the above sample, we can know that the total number of wal_fsync operations performed by the current Prometheus Server is 216 times, and the time-consuming is 2.888716127000002s. Among them, the time consumption of the median (quantile=0.5) is 0.012352463, and the time consumption of the 9th quantile (quantile=0.9) is 0.014458005s.

In the sample data returned by Prometheus Server itself, we can also find the monitoring indicator prometheus_tsdb_compaction_chunk_range_bucket of type Histogram.

# HELP prometheus_tsdb_compaction_chunk_range Final time range of chunks on their first compaction
# TYPE prometheus_tsdb_compaction_chunk_range histogram
prometheus_tsdb_compaction_chunk_range_bucket{le="100"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="400"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="1600"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="6400"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="25600"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="102400"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="409600"} 0
prometheus_tsdb_compaction_chunk_range_bucket{le="1.6384e+06"} 260
prometheus_tsdb_compaction_chunk_range_bucket{le="6.5536e+06"} 780
prometheus_tsdb_compaction_chunk_range_bucket{le="2.62144e+07"} 780
prometheus_tsdb_compaction_chunk_range_bucket{le="+Inf"} 780
prometheus_tsdb_compaction_chunk_range_sum 1.1540798e+09
prometheus_tsdb_compaction_chunk_range_count 780

Similar to indicators of the Summary type, samples of the Histogram type will also reflect the total number of records of the current indicator (with _count as the suffix) and the total value of its values ​​(with _sum as the suffix). The difference is that the Histogram index directly reflects the number of samples in different intervals, and the intervals are defined by the label len.

At the same time, for the Histogram indicator, we can also calculate the quantile of its value through the histogram_quantile() function. The difference is that Histogram is a quantile calculated on the server side through the histogram_quantile function. The quantile of Sumamry is calculated directly on the client side. Therefore, for the calculation of quantiles, Summary has better performance when querying through PromQL, while Histogram consumes more resources. On the contrary, Histogram consumes less resources for the client. When choosing these two methods, users should choose according to their actual scenarios.

Tags: Python monitor and control Prometheus programming language

Posted by mk_silence on Sun, 01 Jan 2023 14:17:59 +0300