Kruise Rollout: let all application loads use progressive delivery

Author: Zhao Mingshan (Liheng)

preface

OpenKruise [1] is an open source cloud native application automation management suite of Alibaba cloud. It is also a Sandbox project currently hosted under the Cloud Native Computing Foundation (CNCF). It comes from Alibaba's container and cloud native technology precipitation over the years. It is a standard extension component based on Kubernetes for large-scale application in Alibaba's internal production environment. It is also a technical concept and best practice close to the upstream community standards and adapted to the large-scale scene of the Internet. In addition to the original workload, sidecar management and other fields, Kruise is currently trying in the field of progressive delivery.

What is incremental delivery?

The term "progressive delivery" originated from large and complex industrial projects. It attempts to disassemble complex projects in stages and reduce delivery cost and time through continuous small closed-loop iterations. With the popularity of Kubernetes and cloud original physiological concept, especially after the emergence of continuous deployment pipeline, progressive delivery provides infrastructure and implementation methods for Internet applications.

In the process of product iteration, the specific behavior of progressive delivery can be attached to the pipeline, and the whole delivery pipeline can be regarded as a process of product iteration and a progressive delivery cycle. In practice, progressive delivery is implemented by technical means such as A/B test and Canary / gray release. Taking Taobao commodity recommendation as an example, every time it releases major functions, it will go through a typical progressive delivery process, so as to improve the stability and efficiency of delivery through progressive delivery:

Why Kruise Rollout

Kubernetes only provides the Deployment controller for application delivery and the abstraction of progress and Service for traffic. However, kubernetes does not have a standard definition on how to combine the above implementation into a progressive delivery scheme out of the box. Argo rollout and flag are popular progressive delivery solutions in the community, but they are different from our assumptions in some abilities and ideas. First, they only support Deployment, not stateful, Daemonset, let alone custom operator s; Secondly, they are not "non intrusive and progressive publishing methods". For example, Argo rollout cannot support the community K8S Native Deployment. Flagger copies the Deployment created by the business, resulting in the change of Name, which has some compatibility problems with Gitops or self built Paas.

In addition, a hundred flowers bloom is a major feature of cloud primordial. Alibaba cloud container team is responsible for the evolution of the cloud native architecture of the entire container platform, and there is also a strong demand in the field of application progressive delivery. Therefore, on the basis of referring to the community scheme and considering Alibaba's internal scenarios, we have the following objectives in the process of designing Rollout:

  1. Non intrusive: do not modify the native Workload controller and user-defined Application Yaml definition to ensure the cleanness and consistency of native resources
  2. Extensibility: it supports K8S Native Workload, custom Workload, Nginx, Isito and other Traffic scheduling methods through extensibility
  3. Ease of use: for users, it is out of the box and can be easily combined with community Gitops or self built Paas

Kruise Rollout: incremental delivery capability of bypass

Kruise Rollout [2] is kruise's abstract definition model for progressive delivery. The complete Rollout definition: Canary release, blue-green release and A/B Testing release that meet the application traffic and actual Deployment instances, and the release process can be automatically batch and pause based on Prometheus Metrics indicators, and can provide bypass senseless docking and be compatible with a variety of existing workloads (Deployment, CloneSet and daemon), The architecture is as follows:

Traffic scheduling (Canary, A/B Test, blue-green) and batch release

Canary and batch release are the most commonly used release methods in progressive delivery practice, as follows:

  1. The Workload (Deployment, CloneSet, daemon set) of Rollout is required for the selection of Workload ref bypass.
  2. canary.Steps defines that the whole Rollout process is divided into five batches, of which the first batch is only one new version Pod, and the routing 5% traffic flows to the new version Pod, and it needs to be manually confirmed whether to continue publishing.
  3. The second batch releases 40% of the new version of Pod and 40% of the routing traffic to the new version of Pod. After the release is completed, sleep 10m will automatically release the subsequent batches.
  4. Trafficroutines defines the business progress controller as Nginx, which is designed to be scalable. In addition to Nginx, it can also support Istio, Alb and other traffic controllers.
apiVersion: rollouts.kruise.io/v1alpha1
kind: Rollout
spec:
  strategy:
    objectRef:
      workloadRef:
        apiVersion: apps/v1
        # Deployment, CloneSet, AdDaemonSet etc.
        kind: Deployment 
        name: echoserver
    canary:
      steps:
        # routing 5% traffics to the new version
      - weight: 5
        # Manual confirmation, release the back steps
        pause: {}
        # optional, The first step of released replicas. If not set, the default is to use 'weight', as shown above is 5%.
        replicas: 1
      - weight: 40
        # sleep 600s, release the back steps
        pause: {duration: 600}
      - weight: 60
        pause: {duration: 600}
      - weight: 80
        pause: {duration: 600}
        # The last batch does not need to be configured
      trafficRoutings:
        # echoserver service name
      - service: echoserver
        # nginx ingress
        type: nginx
        # echoserver ingress name
        ingress:
          name: echoserver

Automatic batching and suspension based on Metrics

In the process of Rollout, the business Prometheus Metrics indicators can be automatically analyzed, and then combined with steps to determine whether Rollout needs to be continued or suspended. As shown below, after publishing each batch, analyze the http status code of the service in the past five minutes. If the proportion of http 200 is less than 99.5, the Rollout process will be suspended.

apiVersion: rollouts.kruise.io/v1alpha1
kind: Rollout
spec:
  strategy:
    objectRef:
      ...
    canary:
      steps:
      - weight: 5
        ...
      # metrics analysis  
      analysis:
        templates:
        - templateName: success-rate
          startingStep: 2 # delay starting analysis run until setWeight: 40%
          args:
          - name: service-name
            value: guestbook-svc.default.svc.cluster.local

# metrics analysis template
apiVersion: rollouts.kruise.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 5m
    # NOTE: prometheus queries return results in the form of a vector.
    # So it is common to access the index 0 of the returned array to obtain the value
    successCondition: result[0] >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.example.com:9090
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m]
          )) / 
          sum(irate(
            istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m]
          ))

Canary publishing practice

  1. Suppose that the user has deployed the following echoServer services based on Kubernetes, and provides external services through nginx ingress:

  1. Define the release of Kruise Rollout Canary (a new version of Pod and 5% traffic), and apply -f to Kubernetes cluster
apiVersion: rollouts.kruise.io/v1alpha1
kind: Rollout
metadata:
  name: rollouts-demo
spec:
  objectRef:
    ...
  strategy:
    canary:
      steps:
      - weight: 5
        pause: {}
        replicas: 1
      trafficRoutings:
        ...
  1. Upgrade the echoserver image version (version 1.10.2 - > 1.10.3) and kubectl -f to the Kuberernetes # cluster
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echoserver
...
spec:
  ...
  containers:
  - name: echoserver
    image: cilium/echoserver:1.10.3

Kruise Rollout will automatically start the Canary release process after listening to the above behavior. As shown below, canary Deployment, service and progress are automatically generated, and 5% traffic is configured to the new version of Pods:

  1. After a period of time, the business R & D students can release all the remaining Pods through the command kubectl kruise rollout approve rollout / rollouts Demo - N default after confirming that the new version is normal. Rollout will accurately control the follow-up process. After publishing, it will recycle all canary resources and restore them to the state of user deployment.

  1. If a new version exception is found in the canary process, you can adjust the business image to the previous version (1.10.2), and then kubectl apply -f to the Kubernetes cluster. Kruise Rollout listens to this behavior and will recycle all canary resources to achieve the effect of fast rollback.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echoserver
...
spec:
  ...
  containers:
  - name: echoserver
    image: cilium/echoserver:1.10.2

summary

With the increasing number of applications deployed on Kubernetes, how to achieve the balance between rapid business iteration and application stability is a problem that the platform builder must solve. Kruise aims to explore the new field of batch delivery and kruise's application of batch delivery. Kruise Rollout has been officially released v0 Version 1.0 and integrated with the community OAM KubeVela project. vela users can quickly deploy and use Rollout capabilities through Addons. In addition, we also hope that community users can join us and make more expansion in the field of application delivery.

Related links

[1] OpenKruise:

​https://github.com/openkruise/kruise​

[2] Kruise Rollout:

​https://github.com/openkruise/rollouts/blob/master/docs/getting_started/introduction.md​

👇👇 Poke Here , check the official homepage and documents of OpenKruise project!

Tags: Kubernetes Alibaba Cloud Cloud Native Open Source CNCF

Posted by runfastrick on Tue, 26 Apr 2022 15:03:36 +0300