Ultimate taowa 2.0 | observability practice sharing of cloud native PaaS platform

One Monday morning, Xiao Tao made a cup of hot coffee as usual ☕ When I was ready to open the project and start a new day's work, suddenly the Xiaowen next door shouted, "look, the fryer in the user support group..."

User A: "Git service has A problem, and the code submission failed!"

User B: "take a look and report an error in the execution line..."

User C: "our system will go online today. Now the deployment page can't be opened. It's going to be broken!"

User D:

Xiaotao had to put down his coffee first, switch the screen to the fortress machine, and log in to the server for a set of flowing operations. "Oh, it turned out that the code launched last weekend missed a parameter verification, resulting in panic," Xiaotao said to Xiaowen, pointing to a container log on the screen.

Ten minutes later, Xiaowen updated the online system with the repaired installation package, and the user's problem was also solved.

Although the fault was repaired, Xiao Tao also fell into a deep thought, "why didn't we perceive the system abnormality before the user? Now we still need to log in to the fortress machine to check the container log to find out the cause of the online fault in a faster way and in a shorter time?"

At this time, little L sitting opposite said, "we are all telling users to help them achieve the observability of the system. It's time for Erda to be observed."

Xiao Tao: "what should I do...?" Let's listen to us ~

Usually, we will build an independent distributed tracking, monitoring and logging system to help the development team solve the diagnosis and observation problems in the microservice system. But at the same time, Erda itself also provides full-featured service observation capabilities, and some tracking systems in the community (such as Apache SkyWalking and Jaeger) provide their own observability, which provides us with another idea of using platform capabilities to observe ourselves.

Finally, we chose to realize the observability of Erda on Erda platform. The considerations of using this scheme are as follows:

  • The platform has provided service observation capability, and the introduction of external platform will cause repeated construction and increase the resource cost of the platform
  • The development team uses its own platform to troubleshoot faults and performance problems, and eating its own dog food is also helpful to the improvement of products
  • For the core components of the observability system, such as Kafka and data computing components, we bypass the coverage through the patrol tool of SRE team, and trigger the alarm message when there is a problem

Erda microservice observation platform provides observation and diagnosis tools from different perspectives such as APM, user experience monitoring, link tracking and log analysis. Based on the principle of making the best use of everything, we also process the different observation data generated by Erda respectively. See the specific implementation details and continue to look at them.

OpenTelemetry data access

In the previous article, we introduced How to access Jaeger Trace on Erda First of all, we think of using Jaeger Go SDK as the implementation of link tracking, but OpenTracing, the main implementation of Jaeger, has stopped maintenance, so we focus on the new generation of observability standard OpenTelemetry.

OpenTelemetry It is an observability project of CNCF, which is merged from OpenTracing and OpenCensus. It aims to provide standardization solutions in the field of observability, solve the standardization problems of data model, collection, processing and export of observation data, and provide services unrelated to third-party vendor s.

As shown in the figure below, when accessing the Trace data of OpenTelemetry on Erda observability platform, we need to implement the receiver of otlp protocol on the gateway component and a new span analysis component on the data consumer to analyze the otlp data into the observability data model of Erda APM.

Among them, the gateway component is implemented by Golang lightweight. The core logic is to parse the proto data of otlp, and add authentication and flow restriction for tenant data.

Key code reference receivers/opentelemetry

span_ The analysis component is implemented based on Flink. Through the dynamic gap time window, the span data of opentelemetry is aggregated and analyzed to produce the following Metrics:

  • service_node describes the nodes and instances of the service
  • servicecall * describes the call indicators of services and interfaces, including HTTP, RPC, DB and Cache
  • servicecall*_error describes the abnormal call of the service, including HTTP, RPC, DB and Cache
  • service_relation describes the calling relationship between services

Simultaneous span_analysis will also convert otlp's span into Erda's span standard model, transfer the above metrics and converted span data to kafka, and then be consumed and stored by the existing data consumption components of Erda observability platform.

Key code reference analyzer/tracing

Through the above methods, we have completed Erda's access and processing of OpenTelemetry Trace data.

Next, let's take a look at how Erda's own services connect to OpenTelemetry.

Golang non intrusive call interception

Erda As a cloud native PaaS platform, it is also natural to use the most popular Golang in the cloud native field for development and implementation, but in the early days of Erda, we did not preset the buried point of tracking in the logic of any platform. Therefore, even if OpenTelemetry provides Go SDK out of the box, we only need to invest huge cost in manual Span access in the core logic.

Java and before me NET Core project experience, AOP will be used to achieve performance and call non business-related logic such as link embedding point. Although Golang language does not provide a mechanism similar to Java Agent to allow us to modify code logic during program operation, we are still inspired by monkey project. After fully comparing and testing monkey, pinpoint APM / go AOP agent and gohook, we chose gohook as Erda's AOP implementation idea, and finally provided the implementation of automatic tracking of buried points in Erda infra.

For the principle of monkey, please refer to monkey-patching-in-go

Taking the automatic tracking of HTTP server as an example, our core implementation is as follows:

//go:linkname serverHandler net/http.serverHandler
type serverHandler struct {
  srv *http.Server
}

//go:linkname serveHTTP net/http.serverHandler.ServeHTTP
//go:noinline
func serveHTTP(s *serverHandler, rw http.ResponseWriter, req *http.Request)

//go:noinline
func originalServeHTTP(s *serverHandler, rw http.ResponseWriter, req *http.Request) {}

var tracedServerHandler = otelhttp.NewHandler(http.HandlerFunc(func(rw http.ResponseWriter, r *http.Request) {
  injectcontext.SetContext(r.Context())
  defer injectcontext.ClearContext()
  s := getServerHandler(r.Context())
  originalServeHTTP(s, rw, r)
}), "", otelhttp.WithSpanNameFormatter(func(operation string, r *http.Request) string {
  u := *r.URL
  u.RawQuery = ""
  u.ForceQuery = false
  return r.Method + " " + u.String()
}))

type _serverHandlerKey int8

const serverHandlerKey _serverHandlerKey = 0

func withServerHandler(ctx context.Context, s *serverHandler) context.Context {
  return context.WithValue(ctx, serverHandlerKey, s)
}

func getServerHandler(ctx context.Context) *serverHandler {
  return ctx.Value(serverHandlerKey).(*serverHandler)
}

//go:noinline
func wrappedHTTPHandler(s *serverHandler, rw http.ResponseWriter, req *http.Request) {
  req = req.WithContext(withServerHandler(req.Context(), s))
  tracedServerHandler.ServeHTTP(rw, req)
}

func init() {
  hook.Hook(serveHTTP, wrappedHTTPHandler, originalServeHTTP)
}

After solving the automatic embedding point of Golang, a thorny problem we also encounter is that in the asynchronous scenario, the TraceContext cannot be passed to the next Goroutine due to the context switching. Similarly, after referring to the two asynchronous programming models of Java's Future and C#'s Task, we also implemented the asynchronous API for automatically passing the Trace context:

future1 := parallel.Go(ctx, func(ctx context.Context) (interface{}, error) {
    req, err := http.NewRequestWithContext(ctx, http.MethodGet, "http://www.baidu.com/api_1", nil)
    if err != nil {
      return nil, err
    }
    resp, err := http.DefaultClient.Do(req)
    if err != nil {
      return nil, err
    }
    defer resp.Body.Close()
    byts, err := ioutil.ReadAll(resp.Body)
    if err != nil {
      return nil, err
    }
    return string(byts), nil
  })

  future2 := parallel.Go(ctx, func(ctx context.Context) (interface{}, error) {
    req, err := http.NewRequestWithContext(ctx, http.MethodGet, "http://www.baidu.com/api_2", nil)
    if err != nil {
      return nil, err
    }
    resp, err := http.DefaultClient.Do(req)
    if err != nil {
      return nil, err
    }
    defer resp.Body.Close()
    byts, err := ioutil.ReadAll(resp.Body)
    if err != nil {
      return nil, err
    }
    return string(byts), nil
  }, parallel.WithTimeout(10*time.Second))

  body1, err := future1.Get()
  if err != nil {
    return nil, err
  }

  body2, err := future2.Get()
  if err != nil {
    return nil, err
  }

  return &pb.HelloResponse{
    Success: true,
    Data:    body1.(string) + body2.(string),
  }, nil

Write at the end

After using OpenTelemetry to connect the Trace data generated by Erda platform call to Erda's own APM, the first benefit we can get is that we can intuitively get Erda's runtime topology:

Through this topology, we can see many problems existing in Erda's own architecture design, such as circular dependency of services and outlier services. According to our own observation data, we can also gradually optimize Erda's invocation architecture in each version iteration.

The SRE team next door can also know the abnormal status of the platform at the first time according to the alarm message generated by the call exception automatically analyzed by Erda APM:

Finally, based on the observation data, our development team can easily gain insight into the slow call of the platform and analyze the faults and performance bottlenecks according to Trace:

Xiao L: "in addition to the above, we can also use similar ideas to connect the log and page access speed of the platform to Erda's observability platform."

Xiao Tao suddenly realized, "I see. It turns out that taowa observation can still play like this! In the future, you can safely drink coffee and do your own work 😄. ”

We are committed to solving the problems and needs fed back by community users in the actual production environment,

If you have any questions or suggestions,

Tags: Cloud Native paas Open Source

Posted by latvaustin on Tue, 19 Apr 2022 10:41:36 +0300