Create cloud native large-scale distributed monitoring system: Kvass+Thanos monitoring super large-scale container cluster


Follow the previous article Thanos deployment and Practice More than half a year after its release, with the development of technology, this series has ushered in another update. This paper will introduce how to combine Kvass and Thanos to better realize the monitoring in the scenario of large-scale container cluster.

Isn't Thanos enough?

Some students may ask, isn't Thanos just to solve the distributed problem of Prometheus? Can't Thanos realize large-scale Prometheus monitoring? Why do you need a Kvass?
Thanos solves the problem of Prometheus distributed storage and query, but does not solve the problem of Prometheus distributed collection. If too many tasks and data are collected, Prometheus will still reach the bottleneck. However, for this problem, we will discuss it in the first article of the series Optimization methods of Prometheus in large-scale scenarios Some optimization methods are described in:

  1. Split the collection task from the service dimension to different Prometheus instances.
  2. Use the hashmod provided by Prometheus to slice the acquisition task.

However, these optimization methods still have some disadvantages:

  1. The configuration is cumbersome, and the collection configuration of each Prometheus instance needs to be configured separately.
  2. The data scale needs to be estimated in advance before configuration.
  3. The collection tasks of different Prometheus instances are different, and the load is likely to be unbalanced. If the control is not good, the load of some instances may still be too high.
  4. If it is necessary to expand and shrink the Prometheus, it needs to be adjusted manually, and it is impossible to expand and shrink the Prometheus automatically.

Kvass is born to solve these problems, which is also the focus of this paper.

What is Kvass?

Kvass project is a lightweight Prometheus horizontal expansion and contraction scheme open source by Tencent cloud. It skillfully separates the service discovery from the collection process, and uses Sidecar to dynamically generate configuration files for Prometheus, so as to achieve the effect that different Prometheus can collect different tasks without manual configuration, and can load balance the collection tasks to avoid excessive load on some Prometheus instances, Even if the load is high, it can be automatically expanded. Combined with Thanos's global view, it is easy to build a super large-scale cluster monitoring system using only one configuration file. The following is the architecture of Kvass+Thanos:

Please refer to Kvass for more details How to monitor Kubernetes cluster of 100000 container s with Prometheus , the principle and application effect are introduced in detail.

Deployment practice

Deployment preparation

First download the repo of Kvass and enter the examples Directory:

git clone
cd kvass/examples

Before deploying Kvass, we need to have service exposure indicators for collection. We provide a metrics data generator, which can specify to generate a certain number of series. In this example, we will deploy six copies of metrics generators, each of which will generate 10045 series and deploy them to the cluster with one click:

kubectl create -f  metrics.yaml

Deploy Kvass

Then let's deploy Kvass:

kubectl create -f kvass-rbac.yaml # RBAC configuration required by Kvass
kubectl create -f config.yaml # Prometheus profile
kubectl create -f coordinator.yaml # Kvass coordinator deployment configuration

Where, config The Prometheus configuration file of yaml is equipped with the collection of the metrics generator just deployed:

  scrape_interval: 15s
  evaluation_interval: 15s
    cluster: custom
- job_name: 'metrics-test'
    - role: pod
  - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
    regex: metrics
    action: keep
  - source_labels: [__meta_kubernetes_pod_ip]
    action: replace
    regex: (.*)
    replacement: ${1}:9091
    target_label: __address__
  - source_labels:
    - __meta_kubernetes_pod_name
    target_label: pod

coordinator.yaml we set the maximum number of head series of each partition to no more than 30000 in the startup parameters of Coordinator:


Then deploy the Prometheus instance (including Thanos Sidecar and Kvass Sidecar). At first, you can only need a single copy:

kubectl create -f prometheus-rep-0.yaml

If you need to store data into an object store, refer to the previous article Thanos deployment and Practice Modify the configuration of Thanos Sidecar.

Deploy thanos query

To get the global data, we need to deploy a thanos query:

kubectl create -f thanos-query.yaml

According to the above calculation, there are 6 monitoring targets, 60270 Series in total. According to our setting that each slice cannot exceed 30000 series, it is expected to need 3 slices. We found that the Coordinator successfully changed the number of copies of StatefulSet to 3.

$ kubectl get pods
NAME                                READY   STATUS    RESTARTS   AGE
kvass-coordinator-c68f445f6-g9q5z   2/2     Running   0          64s
metrics-5876dccf65-5cncw            1/1     Running   0          75s
metrics-5876dccf65-6tw4b            1/1     Running   0          75s
metrics-5876dccf65-dzj2c            1/1     Running   0          75s
metrics-5876dccf65-gz9qd            1/1     Running   0          75s
metrics-5876dccf65-r25db            1/1     Running   0          75s
metrics-5876dccf65-tdqd7            1/1     Running   0          75s
prometheus-rep-0-0                  3/3     Running   0          54s
prometheus-rep-0-1                  3/3     Running   0          45s
prometheus-rep-0-2                  3/3     Running   0          45s
thanos-query-69b9cb857-d2b45        1/1     Running   0          49s

We can view the global data through thanos query and find that the data is complete (where metrics0 is the indicator name generated by the indicator generator):

If you need to use the Grafana panel to view the monitoring data, you can add thanos query address as the Prometheus data source: http://thanos-query.default.svc.cluster.local:9090 .


This paper introduces how to combine Kvass and Thanos to realize the monitoring of large-scale container clusters. If you use Tencent cloud container service, you can directly use the cloud native monitoring service under the operation and maintenance center. This service is a product based on Kvass.

[Tencent cloud native] new products of cloud talk, new technologies of cloud research, new activities of cloud travel, information of cloud appreciation, scan the code to pay attention to the official account of the same name, and get more dry goods in time!!

Tags: Kubernetes Distribution monitor and control Cloud Native

Posted by cretaceous on Tue, 03 May 2022 09:25:30 +0300