OpenTelemetryNote
Reference
https://opentelemetry.io/docs/what-is-opentelemetry/
What is OpenTelemetry?
A brief explanation of what OpenTelemetry is and isn’t.
OpenTelemetry is:
-
An observability framework and toolkit designed to facilitate the
- Generation
- Export
- Collection
of telemetry data such as traces, metrics, and logs.
-
Open source, as well as vendor- and tool-agnostic, meaning that it can be used with a broad variety of observability backends, including open source tools like Jaeger and Prometheus, as well as commercial offerings. OpenTelemetry is not an observability backend itself.
A major goal of OpenTelemetry is to enable easy instrumentation of your applications and systems, regardless of the programming language, infrastructure, and runtime environments used.
The backend (storage) and the frontend (visualization) of telemetry data are intentionally left to other tools.
Concept
Observability primer
https://opentelemetry.io/docs/concepts/observability-primer/
Understanding distributed tracing
Distributed tracing lets you observe requests as they propagate through complex, distributed systems. Distributed tracing improves the visibility of your application or system’s health and lets you debug behavior that is difficult to reproduce locally. It is essential for distributed systems, which commonly have nondeterministic problems or are too complicated to reproduce locally.
To understand distributed tracing, you need to understand the role of each of its components: logs, spans, and traces.
Logs
A log is a timestamped message emitted by services or other components. Unlike traces, they aren’t necessarily associated with any particular user request or transaction. You can find logs almost everywhere in software. Logs have been heavily relied on in the past by both developers and operators to help them understand system behavior.
Sample log:
I, [2021-02-23T13:26:23.505892 #22473] INFO -- : [6459ffe1-ea53-4044-aaa3-bf902868f730] Started GET "/" for ::1 at 2021-02-23 13:26:23 -0800
Logs aren’t enough for tracking code execution, as they usually lack contextual information, such as where they were called from.
They become far more useful when they are included as part of a span, or when they are correlated with a trace and a span.
For more on logs and how they pertain to OpenTelemetry, see Logs.
Spans
A span represents a unit of work or operation. Spans track specific operations that a request makes, painting a picture of what happened during the time in which that operation was executed.
A span contains name, time-related data, structured log messages, and other metadata (that is, Attributes) to provide information about the operation it tracks.
Span attributes
Span attributes are metadata attached to a span.
The following table contains examples of span attributes:
| Key | Value |
|---|---|
http.request.method |
"GET" |
network.protocol.version |
"1.1" |
url.path |
"/webshop/articles/4" |
url.query |
"?s=1" |
server.address |
"example.com" |
server.port |
8080 |
url.scheme |
"https" |
http.route |
"/webshop/articles/:article_id" |
http.response.status_code |
200 |
client.address |
"192.0.2.4" |
client.socket.address |
"192.0.2.5"(the client goes through a proxy) |
user_agent.original |
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0" |
For more on spans and how they relate to OpenTelemetry, see Spans.
Distributed traces
A distributed trace, more commonly known as a trace, records the paths taken by requests (made by an application or end-user) as they propagate through multi-service architectures, like microservice and serverless applications.
A trace is made of one or more spans. The first span represents the root span. Each root span represents a request from start to finish. The spans underneath the parent provide a more in-depth context of what occurs during a request (or what steps make up a request).
Without tracing, finding the root cause of performance problems in a distributed system can be challenging. Tracing makes debugging and understanding distributed systems less daunting by breaking down what happens within a request as it flows through a distributed system.
Context
Context is an object that contains the information for the sending and receiving service, or execution unit, to correlate one signal with another.
For example, if service A calls service B, then a span from service A whose ID is in context will be used as the parent span for the next span created in service B. The trace ID that is in context will be used for the next span created in service B as well, which means that the span is part of the same trace as the span from service A.
Propagation
Propagation is the mechanism that moves context between services and processes. It serializes or deserializes the context object and provides the relevant information to be propagated from one service to another.
Propagation is usually handled by instrumentation libraries and is transparent to the user. In the event that you need to manually propagate context, you can use the Propagators API.
OpenTelemetry maintains several official propagators. The default propagator is using the headers specified by the W3C TraceContext specification
Signals
OpenTelemetry currently supports:
Traces
The path of a request through your application.
Traces give us the big picture of what happens when a request is made to an application. Whether your application is a monolith with a single database or a sophisticated mesh of services, traces are essential to understanding the full “path” a request takes in your application.
Let’s explore this with three units of work, represented as Spans:
Note
The following JSON examples do not represent a specific format, and especially not OTLP/JSON, which is more verbose.
hello span:
{
"name": "hello",
"context": {
"trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
"span_id": "051581bf3cb55c13"
},
"parent_id": null,
"start_time": "2022-04-29T18:52:58.114201Z",
"end_time": "2022-04-29T18:52:58.114687Z",
"attributes": {
"http.route": "some_route1"
},
"events": [
{
"name": "Guten Tag!",
"timestamp": "2022-04-29T18:52:58.114561Z",
"attributes": {
"event_attributes": 1
}
}
]
}
This is the root span, denoting the beginning and end of the entire operation. Note that it has a trace_id field indicating the trace, but has no parent_id. That’s how you know it’s the root span.
hello-greetings span:
{
"name": "hello-greetings",
"context": {
"trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
"span_id": "5fb397be34d26b51"
},
"parent_id": "051581bf3cb55c13",
"start_time": "2022-04-29T18:52:58.114304Z",
"end_time": "2022-04-29T22:52:58.114561Z",
"attributes": {
"http.route": "some_route2"
},
"events": [
{
"name": "hey there!",
"timestamp": "2022-04-29T18:52:58.114561Z",
"attributes": {
"event_attributes": 1
}
},
{
"name": "bye now!",
"timestamp": "2022-04-29T18:52:58.114585Z",
"attributes": {
"event_attributes": 1
}
}
]
}
This span encapsulates specific tasks, like saying greetings, and its parent is the hello span. Note that it shares the same trace_id as the root span, indicating it’s a part of the same trace. Additionally, it has a parent_id that matches the span_id of the hello span.
hello-salutations span:
{
"name": "hello-salutations",
"context": {
"trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
"span_id": "93564f51e1abe1c2"
},
"parent_id": "051581bf3cb55c13",
"start_time": "2022-04-29T18:52:58.114492Z",
"end_time": "2022-04-29T18:52:58.114631Z",
"attributes": {
"http.route": "some_route3"
},
"events": [
{
"name": "hey there!",
"timestamp": "2022-04-29T18:52:58.114561Z",
"attributes": {
"event_attributes": 1
}
}
]
}
This span represents the third operation in this trace and, like the previous one, it’s a child of the hello span. That also makes it a sibling of the hello-greetings span.
These three blocks of JSON all share the same trace_id, and the parent_id field represents a hierarchy. That makes it a Trace!
Another thing you’ll note is that each Span looks like a structured log. That’s because it kind of is! One way to think of Traces is that they’re a collection of structured logs with context, correlation, hierarchy, and more baked in. However, these “structured logs” can come from different processes, services, VMs, data centers, and so on. This is what allows tracing to represent an end-to-end view of any system.
To understand how tracing in OpenTelemetry works, let’s look at a list of components that will play a part in instrumenting our code.
Tracer Provider
A Tracer Provider (sometimes called TracerProvider) is a factory for Tracers. In most applications, a Tracer Provider is initialized once and its lifecycle matches the application’s lifecycle. Tracer Provider initialization also includes Resource and Exporter initialization. It is typically the first step in tracing with OpenTelemetry. In some language SDKs, a global Tracer Provider is already initialized for you.
Tracer
A Tracer creates spans containing more information about what is happening for a given operation, such as a request in a service. Tracers are created from Tracer Providers.
Trace Exporters
Trace Exporters send traces to a consumer. This consumer can be standard output for debugging and development-time, the OpenTelemetry Collector, or any open source or vendor backend of your choice.
Context Propagation
Context Propagation is the core concept that enables Distributed Tracing. With Context Propagation, Spans can be correlated with each other and assembled into a trace, regardless of where Spans are generated. To learn more about this topic, see the concept page on Context Propagation.
Spans
A span represents a unit of work or operation. Spans are the building blocks of Traces. In OpenTelemetry, they include the following information:
- Name
- Parent span ID (empty for root spans)
- Start and End Timestamps
- Span Context
- Attributes
- Span Events
- Span Links
- Span Status
Sample span:
{
"name": "/v1/sys/health",
"context": {
"trace_id": "7bba9f33312b3dbb8b2c2c62bb7abe2d",
"span_id": "086e83747d0e381e"
},
"parent_id": "",
"start_time": "2021-10-22 16:04:01.209458162 +0000 UTC",
"end_time": "2021-10-22 16:04:01.209514132 +0000 UTC",
"status_code": "STATUS_CODE_OK",
"status_message": "",
"attributes": {
"net.transport": "IP.TCP",
"net.peer.ip": "172.17.0.1",
"net.peer.port": "51820",
"net.host.ip": "10.177.2.152",
"net.host.port": "26040",
"http.method": "GET",
"http.target": "/v1/sys/health",
"http.server_name": "mortar-gateway",
"http.route": "/v1/sys/health",
"http.user_agent": "Consul Health Check",
"http.scheme": "http",
"http.host": "10.177.2.152:26040",
"http.flavor": "1.1"
},
"events": [
{
"name": "",
"message": "OK",
"timestamp": "2021-10-22 16:04:01.209512872 +0000 UTC"
}
]
}
Spans can be nested, as is implied by the presence of a parent span ID: child spans represent sub-operations. This allows spans to more accurately capture the work done in an application.
Span Context
Span context is an immutable object on every span that contains the following:
- The Trace ID representing the trace that the span is a part of
- The span’s Span ID
- Trace Flags, a binary encoding containing information about the trace
- Trace State, a list of key-value pairs that can carry vendor-specific trace information
Attributes
Attributes are key-value pairs that contain metadata that you can use to annotate a Span to carry information about the operation it is tracking.
For example, if a span tracks an operation that adds an item to a user’s shopping cart in an eCommerce system, you can capture the user’s ID, the ID of the item to add to the cart, and the cart ID.
You can add attributes to spans during or after span creation. Prefer adding attributes at span creation to make the attributes available to SDK sampling. If you have to add a value after span creation, update the span with the value.
Attributes have the following rules that each language SDK implements:
- Keys must be non-null string values
- Values must be a non-null string, boolean, floating point value, integer, or an array of these values
Additionally, there are Semantic Attributes, which are known naming conventions for metadata that is typically present in common operations. It’s helpful to use semantic attribute naming wherever possible so that common kinds of metadata are standardized across systems
Metrics
A measurement captured at runtime.
A metric is a measurement of a service captured at runtime. The moment of capturing a measurements is known as a metric event, which consists not only of the measurement itself, but also the time at which it was captured and associated metadata.
Application and request metrics are important indicators of availability and performance. Custom metrics can provide insights into how availability indicators impact user experience or the business. Collected data can be used to alert of an outage or trigger scheduling decisions to scale up a deployment automatically upon high demand.
To understand how metrics in OpenTelemetry works, let’s look at a list of components that will play a part in instrumenting our code.
Metric Instruments
In OpenTelemetry measurements are captured by metric instruments. A metric instrument is defined by:
- Name
- Kind
- Unit (optional)
- Description (optional)
The name, unit, and description are chosen by the developer or defined via semantic conventions for common ones like request and process metrics.
The instrument kind is one of the following:
- Counter: A value that accumulates over time – you can think of this like an odometer on a car; it only ever goes up.
- Asynchronous Counter: Same as the Counter, but is collected once for each export. Could be used if you don’t have access to the continuous increments, but only to the aggregated value.
- UpDownCounter: A value that accumulates over time, but can also go down again. An example could be a queue length, it will increase and decrease with the number of work items in the queue.
- Asynchronous UpDownCounter: Same as the UpDownCounter, but is collected once for each export. Could be used if you don’t have access to the continuous changes, but only to the aggregated value (e.g., current queue size).
- Gauge: Measures a current value at the time it is read. An example would be the fuel gauge in a vehicle. Gauges are synchronous.
- Asynchronous Gauge: Same as the Gauge, but is collected once for each export. Could be used if you don’t have access to the continuous changes, but only to the aggregated value.
- Histogram: A client-side aggregation of values, such as request latencies. A histogram is a good choice if you are interested in value statistics. For example: How many requests take fewer than 1s?
For more on synchronous and asynchronous instruments, and which kind is best suited for your use case, see Supplementary Guidelines.
Aggregation
In addition to the metric instruments, the concept of aggregations is an important one to understand. An aggregation is a technique whereby a large number of measurements are combined into either exact or estimated statistics about metric events that took place during a time window. The OTLP protocol transports such aggregated metrics. The OpenTelemetry API provides a default aggregation for each instrument which can be overridden using the Views. The OpenTelemetry project aims to provide default aggregations that are supported by visualizers and telemetry backends.
Unlike request tracing, which is intended to capture request lifecycles and provide context to the individual pieces of a request, metrics are intended to provide statistical information in aggregate. Some examples of use cases for metrics include:
- Reporting the total number of bytes read by a service, per protocol type.
- Reporting the total number of bytes read and the bytes per request.
- Reporting the duration of a system call.
- Reporting request sizes in order to determine a trend.
- Reporting CPU or memory usage of a process.
- Reporting average balance values from an account.
- Reporting current active requests being handled
Views
A view provides SDK users with the flexibility to customize the metrics output by the SDK. You can customize which metric instruments are to be processed or ignored. You can also customize aggregation and what attributes you want to report on metrics
Logs
A recording of an event.
A log is a timestamped text record, either structured (recommended) or unstructured, with optional metadata. Of all telemetry signals, logs have the biggest legacy. Most programming languages have built-in logging capabilities or well-known, widely used logging libraries
OpenTelemetry logs
OpenTelemetry does not define a bespoke API or SDK to create logs. Instead, OpenTelemetry logs are the existing logs you already have from a logging framework or infrastructure component. OpenTelemetry SDKs and autoinstrumentation utilize several components to automatically correlate logs with traces.
OpenTelemetry’s support for logs is designed to be fully compatible with what you already have, providing capabilities to wrap those logs with additional context and a common toolkit to parse and manipulate logs into a common format across many different sources
OpenTelemetry logs in the OpenTelemetry Collector
The OpenTelemetry Collector provides several tools to work with logs:
- Several receivers which parse logs from specific, known sources of log data.
- The
filelogreceiver, which reads logs from any file and provides features to parse them from different formats or use a regular expression. - Processors like the
transformprocessorwhich lets you parse nested data, flatten nested structures, add/remove/update values, and more. - Exporters that let you emit log data in a non-OpenTelemetry format.
The first step in adopting OpenTelemetry frequently involves deploying a Collector as a general-purposes logging agent.
OpenTelemetry logs for applications
In applications, OpenTelemetry logs are created with any logging library or built-in logging capabilities. When you add autoinstrumentation or activate an SDK, OpenTelemetry will automatically correlate your existing logs with any active trace and span, wrapping the log body with their IDs. In other words, OpenTelemetry automatically correlates your logs and traces
Baggage
Contextual information that is passed between signals.
In OpenTelemetry, Baggage is contextual information that resides next to context. Baggage is a key-value store, which means it lets you propagate any data you like alongside context.
Baggage means you can pass data across services and processes, making it available to add to traces, metrics, or logs in those services.
Example
Baggage is often used in tracing to propagate additional data across services.
For example, imagine you have a clientId at the start of a request, but you’d like for that ID to be available on all spans in a trace, some metrics in another service, and some logs along the way. Because the trace may span multiple services, you need some way to propagate that data without copying the clientId across many places in your codebase.
By using Context Propagation to pass baggage across these services, the clientId is available to add to any additional spans, metrics, or logs. Additionally, instrumentations automatically propagate baggage for you.
What should OTel Baggage be used for?
Baggage is best used to include information typically available only at the start of a request further downstream. This can include things like Account Identification, User IDs, Product IDs, and origin IPs, for example.
Propagating this information using baggage allows for deeper analysis of telemetry in a backend. For example, if you include information like a User ID on a span that tracks a database call, you can much more easily answer questions like “which users are experiencing the slowest database calls?” You can also log information about a downstream operation and include that same User ID in the log data.
Instrumentation
How OpenTelemetry facilitates instrumentation
For a system to be observable, it must be instrumented: that is, code from the system’s components must emit signals, such as traces, metrics, and logs.
Using OpenTelemetry, you can instrument your code in two primary ways:
Code-based solutions allow you to get deeper insight and rich telemetry from your application itself. They let you use the OpenTelemetry API to generate telemetry from your application, which acts as an essential complement to the telemetry generated by zero-code solutions.
Zero-code solutions are great for getting started, or when you can’t modify the application you need to get telemetry out of. They provide rich telemetry from libraries you use and/or the environment your application runs in. Another way to think of it is that they provide information about what’s happening at the edges of your application.
You can use both solutions simultaneously
文章作者:Administrator
文章链接:http://localhost:8090//archives/opentelemetrynote
版权声明:本博客所有文章除特别声明外,均采用CC BY-NC-SA 4.0 许可协议,转载请注明出处!
评论