Author: Zhao Mingshan (Liheng)
preface
OpenKruise [1] is an open source cloud native application automation management suite of Alibaba cloud. It is also a Sandbox project currently hosted under the Cloud Native Computing Foundation (CNCF). It comes from Alibaba's container and cloud native technology precipitation over the years. It is a standard extension component based on Kubernetes for large-scale application in Alibaba's internal production environment. It is also a technical concept and best practice close to the upstream community standards and adapted to the large-scale scene of the Internet. In addition to the original workload, sidecar management and other fields, Kruise is currently trying in the field of progressive delivery.
What is incremental delivery?
The term "progressive delivery" originated from large and complex industrial projects. It attempts to disassemble complex projects in stages and reduce delivery cost and time through continuous small closed-loop iterations. With the popularity of Kubernetes and cloud original physiological concept, especially after the emergence of continuous deployment pipeline, progressive delivery provides infrastructure and implementation methods for Internet applications.
In the process of product iteration, the specific behavior of progressive delivery can be attached to the pipeline, and the whole delivery pipeline can be regarded as a process of product iteration and a progressive delivery cycle. In practice, progressive delivery is implemented by technical means such as A/B test and Canary / gray release. Taking Taobao commodity recommendation as an example, every time it releases major functions, it will go through a typical progressive delivery process, so as to improve the stability and efficiency of delivery through progressive delivery:
Why Kruise Rollout
Kubernetes only provides the Deployment controller for application delivery and the abstraction of progress and Service for traffic. However, kubernetes does not have a standard definition on how to combine the above implementation into a progressive delivery scheme out of the box. Argo rollout and flag are popular progressive delivery solutions in the community, but they are different from our assumptions in some abilities and ideas. First, they only support Deployment, not stateful, Daemonset, let alone custom operator s; Secondly, they are not "non intrusive and progressive publishing methods". For example, Argo rollout cannot support the community K8S Native Deployment. Flagger copies the Deployment created by the business, resulting in the change of Name, which has some compatibility problems with Gitops or self built Paas.
In addition, a hundred flowers bloom is a major feature of cloud primordial. Alibaba cloud container team is responsible for the evolution of the cloud native architecture of the entire container platform, and there is also a strong demand in the field of application progressive delivery. Therefore, on the basis of referring to the community scheme and considering Alibaba's internal scenarios, we have the following objectives in the process of designing Rollout:
- Non intrusive: do not modify the native Workload controller and user-defined Application Yaml definition to ensure the cleanness and consistency of native resources
- Extensibility: it supports K8S Native Workload, custom Workload, Nginx, Isito and other Traffic scheduling methods through extensibility
- Ease of use: for users, it is out of the box and can be easily combined with community Gitops or self built Paas
Kruise Rollout: incremental delivery capability of bypass
Kruise Rollout [2] is kruise's abstract definition model for progressive delivery. The complete Rollout definition: Canary release, blue-green release and A/B Testing release that meet the application traffic and actual Deployment instances, and the release process can be automatically batch and pause based on Prometheus Metrics indicators, and can provide bypass senseless docking and be compatible with a variety of existing workloads (Deployment, CloneSet and daemon), The architecture is as follows:
Traffic scheduling (Canary, A/B Test, blue-green) and batch release
Canary and batch release are the most commonly used release methods in progressive delivery practice, as follows:
- The Workload (Deployment, CloneSet, daemon set) of Rollout is required for the selection of Workload ref bypass.
- canary.Steps defines that the whole Rollout process is divided into five batches, of which the first batch is only one new version Pod, and the routing 5% traffic flows to the new version Pod, and it needs to be manually confirmed whether to continue publishing.
- The second batch releases 40% of the new version of Pod and 40% of the routing traffic to the new version of Pod. After the release is completed, sleep 10m will automatically release the subsequent batches.
- Trafficroutines defines the business progress controller as Nginx, which is designed to be scalable. In addition to Nginx, it can also support Istio, Alb and other traffic controllers.
apiVersion: rollouts.kruise.io/v1alpha1 kind: Rollout spec: strategy: objectRef: workloadRef: apiVersion: apps/v1 # Deployment, CloneSet, AdDaemonSet etc. kind: Deployment name: echoserver canary: steps: # routing 5% traffics to the new version - weight: 5 # Manual confirmation, release the back steps pause: {} # optional, The first step of released replicas. If not set, the default is to use 'weight', as shown above is 5%. replicas: 1 - weight: 40 # sleep 600s, release the back steps pause: {duration: 600} - weight: 60 pause: {duration: 600} - weight: 80 pause: {duration: 600} # The last batch does not need to be configured trafficRoutings: # echoserver service name - service: echoserver # nginx ingress type: nginx # echoserver ingress name ingress: name: echoserver
Automatic batching and suspension based on Metrics
In the process of Rollout, the business Prometheus Metrics indicators can be automatically analyzed, and then combined with steps to determine whether Rollout needs to be continued or suspended. As shown below, after publishing each batch, analyze the http status code of the service in the past five minutes. If the proportion of http 200 is less than 99.5, the Rollout process will be suspended.
apiVersion: rollouts.kruise.io/v1alpha1 kind: Rollout spec: strategy: objectRef: ... canary: steps: - weight: 5 ... # metrics analysis analysis: templates: - templateName: success-rate startingStep: 2 # delay starting analysis run until setWeight: 40% args: - name: service-name value: guestbook-svc.default.svc.cluster.local # metrics analysis template apiVersion: rollouts.kruise.io/v1alpha1 kind: AnalysisTemplate metadata: name: success-rate spec: args: - name: service-name metrics: - name: success-rate interval: 5m # NOTE: prometheus queries return results in the form of a vector. # So it is common to access the index 0 of the returned array to obtain the value successCondition: result[0] >= 0.95 failureLimit: 3 provider: prometheus: address: http://prometheus.example.com:9090 query: | sum(irate( istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}",response_code!~"5.*"}[5m] )) / sum(irate( istio_requests_total{reporter="source",destination_service=~"{{args.service-name}}"}[5m] ))
Canary publishing practice
- Suppose that the user has deployed the following echoServer services based on Kubernetes, and provides external services through nginx ingress:
- Define the release of Kruise Rollout Canary (a new version of Pod and 5% traffic), and apply -f to Kubernetes cluster
apiVersion: rollouts.kruise.io/v1alpha1 kind: Rollout metadata: name: rollouts-demo spec: objectRef: ... strategy: canary: steps: - weight: 5 pause: {} replicas: 1 trafficRoutings: ...
- Upgrade the echoserver image version (version 1.10.2 - > 1.10.3) and kubectl -f to the Kuberernetes # cluster
apiVersion: apps/v1 kind: Deployment metadata: name: echoserver ... spec: ... containers: - name: echoserver image: cilium/echoserver:1.10.3
Kruise Rollout will automatically start the Canary release process after listening to the above behavior. As shown below, canary Deployment, service and progress are automatically generated, and 5% traffic is configured to the new version of Pods:
- After a period of time, the business R & D students can release all the remaining Pods through the command kubectl kruise rollout approve rollout / rollouts Demo - N default after confirming that the new version is normal. Rollout will accurately control the follow-up process. After publishing, it will recycle all canary resources and restore them to the state of user deployment.
- If a new version exception is found in the canary process, you can adjust the business image to the previous version (1.10.2), and then kubectl apply -f to the Kubernetes cluster. Kruise Rollout listens to this behavior and will recycle all canary resources to achieve the effect of fast rollback.
apiVersion: apps/v1 kind: Deployment metadata: name: echoserver ... spec: ... containers: - name: echoserver image: cilium/echoserver:1.10.2
summary
With the increasing number of applications deployed on Kubernetes, how to achieve the balance between rapid business iteration and application stability is a problem that the platform builder must solve. Kruise aims to explore the new field of batch delivery and kruise's application of batch delivery. Kruise Rollout has been officially released v0 Version 1.0 and integrated with the community OAM KubeVela project. vela users can quickly deploy and use Rollout capabilities through Addons. In addition, we also hope that community users can join us and make more expansion in the field of application delivery.
- Github: https://github.com/openkruise/rollouts
- Official: https://openkruise.io/
- Slack: Channel in Kubernetes Slack
- Nail exchange group:
Related links
[1] OpenKruise:
https://github.com/openkruise/kruise
[2] Kruise Rollout:
https://github.com/openkruise/rollouts/blob/master/docs/getting_started/introduction.md
👇👇 Poke Here , check the official homepage and documents of OpenKruise project!