Observability in Distributed Systems – 分布式系统中的可观察性

最后修改: 2021年 6月 11日

中文/混合/英文(键盘快捷键:t)

1. Introduction

1.绪论

In this tutorial, we’ll discuss observability and why it plays an important role in a distributed system. We’ll cover the types of data that constitute observability. This will help us understand the challenges in collecting, storing, and analyzing telemetry data from a distributed system.

在本教程中,我们将讨论可观察性,以及为什么它在分布式系统中发挥着重要作用。我们将介绍构成可观察性的数据类型。这将有助于我们理解收集、存储和分析分布式系统的遥测数据的挑战。

Finally, we’ll cover some of the industry standards and popular tools in the area of observability.

最后,我们将介绍可观察性领域的一些行业标准和流行工具。

2. What Is Observability?

2.什么是可观察性?

Let’s cut to the chase and get the formal definition out to begin with! Observability is the ability to measure the internal state of a system only by its external outputs.

让我们切入正题,先把正式的定义说出来吧!可观察性是仅通过外部输出来衡量一个系统的内部状态的能力

For a distributed system like microservices, these external outputs are basically known as telemetry data. It includes information like the resource consumption of a machine, the logs generated by the applications running on a machine, and several others.

对于像微服务这样的分布式系统,这些外部输出基本上被称为遥测数据。它包括诸如机器的资源消耗、在机器上运行的应用程序产生的日志以及其他一些信息。

2.1. Types of Telemetry Data

2.1.遥测数据的类型

We can organize telemetry data into three categories that we refer to as the three pillars of observability: logs, metrics, and traces. Let’s understand them in more detail.

我们可以将遥测数据分为三类,我们称之为可观察性的三大支柱:日志、度量和跟踪。让我们更详细地了解它们。

Logs are lines of text that an application generates at discrete points during the execution of the code. Normally these are structured and often generated at different levels of severity. These are quite easy to generate but often carry performance costs. Moreover, we may require additional tools like Logstash to collect, store, and analyze logs efficiently.

日志是应用程序在代码执行过程中在离散点生成的文本行。通常情况下,这些都是结构化的,并且经常在不同的严重程度上生成。这些都是很容易生成的,但往往会带来性能成本。此外,我们可能需要像Logstash这样的额外工具来有效地收集、存储和分析日志。

Simply put, metrics are values represented as counts or measures that we calculate or aggregate over a time period. These values express some data about a system like a virtual machine — for instance, the memory consumption of a virtual machine every second. These can come from various sources like the host, application, and cloud platform.

简单地说,度量衡是以计数或度量表示的数值,我们在一个时间段内计算或汇总。这些值表达了关于虚拟机等系统的一些数据 – 例如,虚拟机每秒的内存消耗。这些数据可以来自主机、应用程序和云平台等不同来源。

Traces are important for distributed systems where a single request can flow through multiple applications. A trace is a representation of distributed events as the request flows through a distributed system. These can be quite helpful in locating the problems like bottlenecks, defects, or other issues in a distributed system.

追踪对于分布式系统非常重要,因为一个请求可以流经多个应用程序。跟踪是在请求流经分布式系统时对分布式事件的一种表示。这些对于定位分布式系统中的瓶颈、缺陷或其他问题是很有帮助的。

2.2. Benefits of Observability

2.2.可观察性的好处

To begin with, we need to understand why we need observability in a system at all. Most of us have probably faced the challenges of troubleshooting difficult-to-understand behaviors on a production system. It’s not difficult to understand that our options to disrupt a production environment are limited. This pretty much leaves us to analyze the data that the system generates.

首先,我们需要了解为什么我们在系统中需要可观察性。我们中的大多数人可能都面临过在生产系统中排除难以理解的行为的挑战。不难理解,我们对生产环境进行破坏的选择是有限的。这几乎让我们只能分析系统所产生的数据。

Observability is invaluable for investigating situations where a system starts to deviate from its intended state. It’s also quite useful to prevent these situations altogether! A careful setup of alerts based on the observable data generated by a system can help us take remedial actions before the system fails altogether. Moreover, such data gives us important analytical insights to tune the system for a better experience.

可观察性对于调查系统开始偏离其预期状态的情况是宝贵的。它对防止这些情况的发生也是相当有用的!根据系统产生的可观察数据仔细设置警报,可以帮助我们在系统完全失效之前采取补救措施。此外,这些数据为我们提供了重要的分析性见解,以调整系统以获得更好的体验。

The need for observability, while important for any system, is quite significant for a distributed system. Moreover, our systems can span public and private cloud as well as on-premise environments. Further, it keeps changing in scale and complexity with time. This can often present problems that were never anticipated before. A highly observable system can tremendously help us handle such situations.

对可观察性的需求虽然对任何系统都很重要,但对分布式系统来说却相当重要。此外,我们的系统可以跨越公共和私有云以及内部环境。此外,它的规模和复杂性也随着时间不断变化。这往往会带来以前从未预料到的问题。一个高度可观察的系统可以极大地帮助我们处理这种情况。

3. Observability vs. Monitoring

3.可观察性与监测

We often hear about monitoring in relation to observability in the practice of DevOps. So, what is the difference between these terms? Well, they both have similar functions and enable us to maintain the system’s reliability. But they have a subtle difference and, in fact, a relationship between them. We can only effectively monitor a system if it’s observable!

DevOps 的实践中,我们经常听到监控与可观察性的关系。那么,这些术语之间有什么区别呢?嗯,它们都有类似的功能,并使我们能够保持系统的可靠性。但它们有一个微妙的区别,事实上,它们之间有一种关系。我们只有在系统可被观察到的情况下,才能有效地监控它!

Monitoring basically refers to the practice of watching a system’s state through a predefined set of metrics and logs. This inherently means that we are watching for a known set of failures. However, in a distributed system, there are a lot of dynamic changes that keep happening. This results in problems that we were never looking for. Hence, our monitoring system can just miss them.

监控基本上是指通过一组预定义的指标和日志观察系统状态的做法。这本质上意味着我们在观察一组已知的故障。然而,在一个分布式系统中,有很多动态变化不断发生。这导致了我们从未寻找过的问题。因此,我们的监控系统可能只是错过了它们。

Observability, on the other hand, helps us understand the internal state of a system. This can enable us to ask arbitrary questions about the system’s behavior. For instance, we can ask complex questions like how did each service handle the request in case of problems. Over time, it can aid in building knowledge about the dynamic behavior of the system.

另一方面,可观察性帮助我们理解系统的内部状态。这可以使我们对系统的行为提出任意的问题。例如,我们可以问一些复杂的问题,如在出现问题时,每个服务是如何处理请求的。随着时间的推移,它可以帮助建立关于系统的动态行为的知识。

To understand why this is so, we need to understand the concept of cardinality. Cardinality refers to the number of unique items in a set. For instance, the set of users’ social security numbers will have a higher cardinality than gender. To answer arbitrary questions about a system’s behavior, we need high cardinality data. However, monitoring typically only deals with low cardinality data.

为了理解为什么会这样,我们需要理解cardinality的概念。cardinality指的是一个集合中唯一项目的数量。例如,用户的社会安全号码集合的cardinality会比性别高。为了回答关于系统行为的任意问题,我们需要高卡度的数据。然而,监测通常只处理低卡度的数据。

4. Observability in a Distributed System

4.分布式系统中的可观察性

As we’ve seen earlier, observability is especially useful for a complex distributed system. But, what exactly makes a distributed system complex, and what are the challenges of observability in such a system? It’s important to understand this question to appreciate the ecosystem of tools and platforms grown around this subject in the last few years.

正如我们前面所看到的,可观察性对于复杂的分布式系统特别有用。但是,究竟是什么让一个分布式系统变得复杂,以及在这样的系统中,可观察性的挑战是什么?理解这个问题对于理解过去几年围绕这个主题发展起来的工具和平台的生态系统很重要。

In a distributed system, there are a lot of moving components that change the system landscape dynamically. Moreover, dynamic scalability means that there will be an uncertain number of instances running for a service at any point in time. This makes the job of collecting, curating, and storing the system output like logs and metrics difficult:

在一个分布式系统中,有很多移动的组件,它们会动态地改变系统的面貌。此外,动态可扩展性意味着在任何时间点都会有不确定的实例在为一项服务运行。这使得收集、整理和存储系统输出(如日志和度量)的工作变得困难。

Further, it’s not sufficient just to understand what is happening within applications of a system. For instance, the problem may be in the network layer or the load balancer. Then there are databases, messaging platforms, and the list goes on. It’s important that all these components are observable at all times. We must be able to gather and centralize meaningful data from all parts of the system.

此外,仅仅了解一个系统的应用中发生的事情是不够的。例如,问题可能出现在网络层或负载平衡器。然后还有数据库、信息传递平台,以及其他列表。重要的是,所有这些组件在任何时候都是可观察的。我们必须能够从系统的所有部分收集和集中有意义的数据。

Moreover, since several components are working together, either synchronously or asynchronously, it’s not easy to pinpoint the source of the anomaly. For instance, it’s difficult the say which service in the system is causing the bottleneck escalating as the performance degradation. Traces, as we’ve seen before, are quite useful in investigating such problems.

此外,由于几个组件在一起工作,无论是同步的还是异步的,要确定异常的来源并不容易。例如,很难说系统中的哪项服务造成了瓶颈,从而导致性能下降。正如我们之前所看到的,在调查这类问题时,跟踪是相当有用的。

5. Evolution of Observability

5.可观察性的演变

Observability has its roots in control theory, a branch of applied mathematics that deals with the use of feedback to influence the behavior of a system to achieve the desired goal. We can apply this principle in several industries, from industrial plants to aircraft operations. For software systems, this has become popular since some social networking sites like Twitter started to work at massive scales.

可观察性起源于控制理论,这是应用数学的一个分支,涉及使用反馈来影响系统的行为以实现预期目标。我们可以在多个行业中应用这一原则,从工业厂房到飞机运行。对于软件系统来说,自从一些社交网站如Twitter开始大规模运作后,这种做法已经变得很流行。

Until recent years, most software systems were monolithic, making it fairly easy to reason about them during incidents. Monitoring was quite effective in indicating typical failure scenarios. Further, it was intuitive to debug the code for identifying problems. But, with the advent of microservices architecture and cloud computing, this quickly became a difficult task.

直到最近几年,大多数软件系统都是单体的,这使得在发生事故时对它们进行推理相当容易。监测在指出典型的故障情况方面相当有效。此外,调试代码以发现问题是很直观的。但是,随着微服务架构和云计算的出现,这很快就成为一项困难的任务。

As this evolution continued, software systems were no longer static — they had numerous components that shifted dynamically. This resulted in problems that were never anticipated before. This gave rise to many tools under the umbrella of Application Performance Management (APM), like AppDynamics and Dynatrace. These tools promised a better way to understand the application code and system behavior.

随着这种演变的继续,软件系统不再是静态的,它们有许多动态变化的组件。这导致了以前从未预料到的问题。这催生了应用性能管理(APM)伞下的许多工具,如AppDynamicsDynatrace。这些工具承诺提供一种更好的方式来了解应用程序的代码和系统行为。

Although these tools have come a long way in evolution, they were fairly metrics-based back then. This prevented them from providing the kind of perspective we required about a systems’ state. However, they were a major step forward. Today, we’ve got a combination of tools to address the three pillars of observability. Of course, the underlying components also need to be observable!

尽管这些工具已经有了长足的发展,但它们在当时是相当基于度量的。这使它们无法提供我们所需要的关于系统状态的观点。然而,它们是向前迈出的一大步。今天,我们已经有了一系列的工具来解决可观察性的三个支柱。当然,底层组件也需要是可观察的。

6. Hands-on with Observability

6.观察性的实践

Now that we’ve covered enough theory about observability, let’s see how we can get this into practice. We’ll use a simple microservices-based distributed system where we’ll develop the individual services with Spring Boot in Java. These services will communicate with each other synchronously using the REST APIs.

现在我们已经介绍了足够多的关于可观察性的理论,让我们看看如何将其付诸实践。我们将使用一个简单的基于微服务的分布式系统,我们将用Java中的Spring Boot开发各个服务。这些服务将使用REST APIs同步地相互通信。

Let’s have a look at our system services:

让我们看一下我们的系统服务。

This is a fairly simple distributed system where the math-service uses APIs provided by addition-service, multiplication-service, and others. Further, the math-service exposes APIs to calculate various formulae. We’ll skip the details of creating these microservices as it’s very straightforward.

这是一个相当简单的分布式系统,数学服务使用加法服务乘法服务和其他提供的API。此外,数学服务公开了计算各种公式的API。我们将跳过创建这些微服务的细节,因为它非常简单明了。

The emphasis of this exercise is to recognize the most common standards and popular tools available today in the context of observability. Our target architecture for this system with observability will look something like the diagram below:

这个练习的重点是在可观察性的背景下认识当今最常见的标准和流行的工具。我们这个具有可观察性的系统的目标架构将如下图所示。

Many of these are also in various stages of recognition with the Cloud Native Computing Foundation (CNCF), an organization that promotes the advancement of container technologies. We’ll see how to use some of these in our distributed system.

其中许多也处于云原生计算基金会(CNCF)的不同认可阶段,该组织致力于推动容器技术的发展。我们将看到如何在我们的分布式系统中使用其中的一些。

7. Traces with OpenTracing

7.用OpenTracing进行追踪

We’ve seen how traces can provide invaluable insights to understand how a single request propagates through a distributed system. OpenTracing is an incubating project under the CNCF. It provides vendor-neutral APIs and instrumentation for distributed tracing. This helps us to add instrumentation to our code that isn’t specific to any vendor.

我们已经看到,追踪可以提供宝贵的洞察力,以了解单个请求如何在分布式系统中传播。OpenTracing是CNCF下的一个孵化项目。它为分布式跟踪提供了厂商中立的 API 和工具。这有助于我们在代码中添加不针对任何供应商的仪器。

The list of tracers available that conform to OpenTracing is growing fast. One of the most popular tracers is Jaeger, which is also a graduated-project under the CNCF.

符合OpenTracing的可用追踪器列表正在快速增长。最受欢迎的追踪器之一是Jaeger,它也是CNCF下的一个毕业项目。

Let’s see how we can use Jaeger with OpenTracing in our application:

让我们看看我们如何在我们的应用程序中使用Jaeger与OpenTracing。

We’ll go through the details later. Just to note, there are several other options like LightStep, Instana, SkyWalking, and Datadog. We can easily switch between these tracers without changing the way we’ve added instrumentation in our code.

我们稍后将详细介绍。只是要注意,还有其他几个选项,如LightStepInstanaSkyWalking,以及Datadog。我们可以在这些追踪器之间轻松切换,而不需要改变我们在代码中添加仪器的方式。

7.1. Concepts and Terminology

7.1.概念和术语

A trace in OpenTracing is composed of spans. A span is an individual unit of work done in a distributed system. Basically, a trace can be seen as a directed acyclic graph (DAG) of spans. We call the edges between spans as references. Every component in a distributed system adds a span to the trace. Spans contain references to other spans, and this helps a trace to recreate the life of a request.

OpenTracing中的追踪是由跨度组成的。一个span是在分布式系统中完成的一个单独的工作单位。基本上,一个跟踪可以被看作是一个跨度的有向无环图(DAG)。我们称跨度之间的边为参考。分布式系统中的每一个组件都会在追踪中加入一个跨度。跨度包含对其他跨度的引用,这有助于追踪重现一个请求的生命周期。

We can visualize the causal relationship between the spans in a trace with a time-axis or a graph:

我们可以用时间轴或图表来可视化跟踪中各跨度之间的因果关系。

Here, we can see the two types of references that OpenTracing defines, “ChildOf” and “FollowsFrom”. These establish the relationship between the child and the parent spans.

在这里,我们可以看到OpenTracing定义的两种类型的引用,”ChildOf “和 “FollowsFrom”。这些建立了子跨和父跨之间的关系。

The OpenTracing specification defines the state that a span captures:

OpenTracing规范定义了一个跨度所捕获的状态。

  • An operation name
  • The start time-stamp and the finish time-stamp
  • A set of key-value span tags
  • A set of key-value span logs
  • The SpanContext

Tags allow user-defined annotations to be part of the span we use to query and filter the trace data. Span tags apply to the whole span. Similarly, logs allow a span to capture logging messages and other debugging or informational output from the application. Span logs can apply to a specific moment or event within a span.

标签允许用户定义的注释成为跨度的一部分我们用来查询和过滤跟踪数据。跨度标签适用于整个跨度。同样地,日志允许跨度捕捉日志信息和其他调试或信息输出,从应用程序。跨度日志可以适用于跨度内的特定时刻或事件。

Finally, the SpanContext is what ties the spans together. It carries data across the process boundaries. Let’s have a quick look at a typical SpanContext:

最后,SpanContext是将跨度联系在一起的东西。它承载着跨越流程边界的数据。让我们快速看看一个典型的SpanContext。


As we can see, it’s primarily comprised of:


我们可以看到,它主要由以下部分组成。

  • The implementation-dependent state like spanId and traceId
  • Any baggage items, which are key-value pairs that cross the process boundary

7.2. Setup and Instrumentation

7.2.设置和仪器设备

We’ll begin with installing Jaeger, the OpenTracing compatible tracer that we’ll be using. Although it has several components, we can install them all with a simple Docker command:

我们将首先安装Jaeger,这是我们将要使用的OpenTracing兼容跟踪器。虽然它有几个组件,但我们可以用一个简单的Docker命令将它们全部安装。

docker run -d -p 5775:5775/udp -p 16686:16686 jaegertracing/all-in-one:latest

Next, we need to import the necessary dependencies in our application. For a Maven-based application, this is as simple as adding the dependency:

接下来,我们需要在我们的应用程序中导入必要的依赖项。对于基于Maven的应用程序,这就像添加依赖项一样简单。

<dependency>
    <groupId>io.opentracing.contrib</groupId>
    <artifactId>opentracing-spring-jaeger-web-starter</artifactId>
    <version>3.3.1</version>
</dependency>

For a Spring Boot-based application, we can leverage this library contributed by third parties. This includes all the necessary dependencies and provides necessary default configurations to instrument web request/response and send traces to Jaeger.

对于基于Spring Boot的应用程序,我们可以利用这个由第三方提供的库。这包括所有必要的依赖性,并提供必要的默认配置,以检测Web请求/响应,并发送跟踪到Jaeger。

On the application side, we need to create a Tracer:

在应用程序方面,我们需要创建一个Tracer

@Bean
public Tracer getTracer() {
    Configuration.SamplerConfiguration samplerConfig = Configuration
      .SamplerConfiguration.fromEnv()
      .withType("const").withParam(1);
    Configuration.ReporterConfiguration reporterConfig = Configuration
      .ReporterConfiguration.fromEnv()
      .withLogSpans(true);
    Configuration config = new Configuration("math-service")
      .withSampler(samplerConfig)
      .withReporter(reporterConfig);
    return config.getTracer();
}

This is sufficient to generate the spans for the services a request passes through. We can also generate child spans within our service if necessary:

这足以为一个请求所经过的服务生成跨度。如果有必要,我们还可以在我们的服务中生成子跨度。

Span span = tracer.buildSpan("my-span").start();
// Some code for which which the span needs to be reported
span.finish();

This is pretty simple and intuitive but extremely powerful when we analyze them for a complex distributed system.

这是相当简单和直观的,但当我们为一个复杂的分布式系统分析它们时,却又是极其强大的。

7.3. Trace Analysis

7.3.痕迹分析

Jaeger comes with a user interface accessible by default at port 16686. It provides a simple way to query, filter, and analyze the trace data with visualization. Let’s see a sample trace for our distributed system:

Jaeger带有一个用户界面,可通过端口16686访问。它提供了一个简单的方法来查询、过滤和分析可视化的跟踪数据。让我们看看我们的分布式系统的跟踪样本。

As we can see, this is a visualization for one particular trace identified by its traceId. It clearly shows all the spans within this trace with details like which service it belongs to and the time it took to complete. This can help us understand where the problem may be in the case of atypical behaviors.

正如我们所看到的,这是一个由traceId识别的特定跟踪的可视化。它清楚地显示了这个跟踪中的所有跨度,以及它属于哪个服务和它完成的时间等细节。这可以帮助我们了解在非典型行为的情况下问题可能出在哪里。

8. Metrics with OpenCensus

8.使用OpenCensus的度量衡

OpenCensus provides libraries for various languages that allow us to collect metrics and distributed traces from our application. It originated at Google but since then has been developed as an open-source project by a growing community. The benefit of OpenCensus is that it can send the data to any backend for analysis. This allows us to abstract our instrumentation code rather than having it coupled to specific backends.

OpenCensus为各种语言提供库,使我们能够从我们的应用程序中收集指标和分布式跟踪。它起源于Google,但从那时起就被一个不断壮大的社区开发为一个开源项目。OpenCensus的好处是,它可以将数据发送到任何后端进行分析。这使我们能够抽象出我们的仪表代码,而不是让它与特定的后端相耦合。

Although OpenCensus can support both traces and metrics, we’ll only use it for metrics in our sample application. There are several backends that we can use. One of the most popular metrics tools is Prometheus, an open-source monitoring solution that is also a graduated-project under the CNCF. Let’s see how Jaeger with OpenCensus integrates with our application:

尽管OpenCensus可以同时支持追踪和度量,但在我们的示例应用程序中,我们只使用它来进行度量。有几个后端,我们可以使用。最受欢迎的指标工具之一是Prometheus,这是一个开源的监控解决方案,也是CNCF下的一个毕业项目。让我们看看带有OpenCensus的Jaeger是如何与我们的应用程序整合的。

Although Prometheus comes with a user interface, we can use a visualization tool like Grafana that integrates well with Prometheus.

虽然Prometheus带有用户界面,但我们可以使用像Grafana这样的可视化工具,它与Prometheus整合得很好。

8.1. Concepts and Terminology

8.1.概念和术语

In OpenCensus, a measure represents a metric type to be recorded. For example, the size of the request payload can be one measure to collect. A measurement is a data point produced after recording a quantity by measure. For example, 80 kb can be a measurement for the request payload size measure. All measures are identified by name, description, and unit.

在OpenCensus中,度量代表要记录的度量类型。例如,请求有效载荷的大小可以是要收集的一个度量。一个测量是通过测量记录一个数量后产生的数据点。例如,80kb可以是请求有效载荷大小措施的一个测量值。所有措施都由名称、描述和单位来标识。

To analyze the stats, we need to aggregate the data with views. Views are basically the coupling of an aggregation applied to a measure and, optionally, tags. OpenCensus supports aggregation methods like count, distribution, sum, and last value. A view is composed of name, description, measure, tag keys, and aggregation. Multiple views can use the same measure with different aggregations.

为了分析统计数字,我们需要用视图来聚合数据。视图基本上是应用于措施的聚合的耦合,也可以选择标签。OpenCensus支持聚集方法,如计数、分布、总和和最后值。一个视图由名称、描述、措施、标签键和聚合组成。多个视图可以使用相同的措施和不同的聚合。

Tags are key-value pairs of data associated with recorded measurements to provide contextual information and to distinguish and group metrics during analysis. When we aggregate measurements to create metrics, we can use tags as labels to break down the metrics. Tags can also be propagated as request headers in a distributed system.

标签是与记录的测量数据相关的键值对,以提供背景信息,并在分析过程中区分和分组度量。当我们汇总测量结果以创建指标时,我们可以使用标签作为标签来分解指标。标签也可以作为分布式系统中的请求头进行传播。

Finally, an exporter can send the metrics to any backend that is capable of consuming them. The exporter can change depending upon the backend without any impact on the client code. This makes OpenCensus vendor-neutral in terms of metrics collection. There are quite a few exporters available in multiple languages for most of the popular backends like Prometheus.

最后,导出器可以将指标发送到任何有能力消费它们的后端。输出器可以根据后端改变,而不会对客户端代码产生任何影响。这使得OpenCensus在指标收集方面不受厂商影响。有相当多的导出器可用于大多数流行的后端(如Prometheus)的多语言版本。

8.2. Setup and Instrumentation

8.2.设置和仪表

Since we’ll be using Prometheus as our backend, we should begin by installing it. This is quick and simple using the official Docker image. Prometheus collects metrics from monitored targets by scraping metrics endpoints on these targets. So, we need to provide the details in the Prometheus configuration YAML file, prometheus.yml:

由于我们将使用Prometheus作为我们的后端,我们应该首先安装它。使用官方的Docker镜像,这是快速而简单的。Prometheus通过刮取这些目标上的指标端点来收集被监控目标的指标。因此,我们需要在Prometheus配置YAML文件中提供细节,prometheus.yml

scrape_configs:
  - job_name: 'spring_opencensus'
    scrape_interval: 10s
    static_configs:
      - targets: ['localhost:8887', 'localhost:8888', 'localhost:8889']

This is a fundamental configuration that tells Prometheus which targets to scrape metrics from. Now, we can start Prometheus with a simple command:

这是一个基本的配置,它告诉Prometheus要从哪些目标中搜刮指标。现在,我们可以用一个简单的命令启动Prometheus

docker run -d -p 9090:9090 -v \
  ./prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

For defining custom metrics, we begin by defining a measure:

对于定义自定义度量,我们首先要定义一个度量

MeasureDouble M_LATENCY_MS = MeasureDouble
  .create("math-service/latency", "The latency in milliseconds", "ms");

Next, we need to record a measurement for the measure we’ve just defined:

接下来,我们需要为我们刚刚定义的措施记录一个测量值

StatsRecorder STATS_RECORDER = Stats.getStatsRecorder();
STATS_RECORDER.newMeasureMap()
  .put(M_LATENCY_MS, 17.0)
  .record();

Then, we need to define an aggregation and view for our measure that will enable us to export this as a metric:

然后,我们需要为我们的措施定义一个聚合和视图,使我们能够将其作为一个指标导出

Aggregation latencyDistribution = Distribution.create(BucketBoundaries.create(
  Arrays.asList(0.0, 25.0, 100.0, 200.0, 400.0, 800.0, 10000.0)));
View view = View.create(
  Name.create("math-service/latency"),
  "The distribution of the latencies",
  M_LATENCY_MS,
  latencyDistribution,
  Collections.singletonList(KEY_METHOD)),
};
ViewManager manager = Stats.getViewManager();
manager.registerView(view);

Finally, for exporting views to Prometheus, we need to create and register the collector and run an HTTP server as a daemon:

最后,为了将视图导出到Prometheus,我们需要创建和注册收集器,并将HTTP服务器作为一个守护程序运行。

PrometheusStatsCollector.createAndRegister();
HTTPServer server = new HTTPServer("localhost", 8887, true);

This is a straightforward example that illustrates how we can record latency as a measure from our application and export that as a view to Prometheus for storage and analysis.

这是一个简单明了的例子,说明了我们如何从我们的应用程序中记录延迟作为衡量标准,并将其作为一个视图导出到Prometheus进行存储和分析。

8.3. Metrics Analysis

8.3.度量衡分析

OpenCensus provides in-process web pages called zPages that display collected data from the process they’re attached to. Further, Prometheus offers the expressions browser that allows us to enter any expression and see its result. However, tools like Grafana provide a more elegant and efficient visualization.

OpenCensus提供了称为zPages的进程内网页,显示它们所连接的进程中收集的数据。此外,Prometheus提供了表达式浏览器,允许我们输入任何表达式并查看其结果。然而,像Grafana这样的工具可以提供更优雅、更高效的可视化。

Installing Grafana using the official Docker image is quite simple:

使用官方Docker镜像安装Grafana是非常简单的。

docker run -d --name=grafana -p 3000:3000 grafana/grafana

Grafana supports querying Prometheus — we simply need to add Prometheus as a data source in Grafana. Then, we can create a graph with a regular Prometheus query expression for metrics:

Grafana支持查询Prometheus – 我们只需要将Prometheus添加为Grafana的数据源。然后,我们就可以用Prometheus的常规查询表达式来创建一个图表,以获取指标。

There are several graph settings that we can use to tune our graph. Additionally, there are several pre-built Grafana dashboards available for Prometheus that we may find useful.

有几个图形设置,我们可以用来调整我们的图形。此外,还有几个预制的Grafana仪表盘可用于Prometheus,我们可能会发现它很有用。

9. Logs with Elastic Stack

9.使用Elastic Stack的日志

Logs can provide invaluable insights into the way an application reacted to an event. Unfortunately, in a distributed system, this is split across multiple components. Hence, it becomes important to collect logs from all the components and store them in one single place for effective analysis. Moreover, we require an intuitive user interface to efficiently query, filter and reference the logs.

日志可以为应用程序对事件的反应方式提供宝贵的洞察力。不幸的是,在一个分布式系统中,这被分割成多个组件。因此,收集所有组件的日志并将其存储在一个单一的地方以进行有效的分析变得非常重要。此外,我们需要一个直观的用户界面来有效地查询、过滤和引用日志。

Elastic Stack is basically a log management platform that, until recently, was a collection of three products – Elasticsearch, Logstash, and Kibana (ELK).

Elastic Stack基本的日志管理平台,直到最近,它还是三个产品的集合 – Elasticsearch、Logstash和Kibana(ELK)。

However, since then, Beats have been added to this stack for efficient data collection.

然而,从那时起,Beats已经被添加到这个堆栈中,以便有效地收集数据

Let’s see how we can use these products in our application:

让我们看看我们如何在我们的应用中使用这些产品。

As we can see, in Java, we can generate logs with a simple abstraction like SLF4J and a logger like Logback. We will skip these details here.

正如我们所看到的,在Java中,我们可以通过SLF4J这样的简单抽象Logback这样的记录器生成日志。在这里我们将跳过这些细节。

The Elastic Stack products are open-source and maintained by Elastic. Together, these provide a compelling platform for log analysis in a distributed system.

Elastic Stack产品是开源的,由Elastic维护。这些产品共同为分布式系统中的日志分析提供了一个引人注目的平台。

9.1. Concepts and Terminology

9.1.概念和术语

As we’ve seen, Elastic Stack is a collection of multiple products. The earliest of these products was Elasticseach, which is a distributed, RESTful, JSON-based search engine. It’s quite popular due to its flexibility and scalability. This is the product that led to the foundation of Elastic. It’s fundamentally based on the Apache Lucene search engine.

正如我们所看到的,Elastic Stack是多种产品的集合。其中最早的产品是Elasticseach,它是一个分布式、RESTful、基于JSON的搜索引擎。由于其灵活性和可扩展性,它相当受欢迎。这是导致Elastic公司成立的产品。它从根本上是基于Apache Lucene搜索引擎。

Elasticsearch stores indices as documents, which are the base unit of storage. These are simple JSON objects. We can use types to subdivide similar types of data inside a document. Indices are the logical partitions of documents. Typically, we can split indices horizontally into shards for scalability. Further, we can also replicate shards for fault tolerance:

Elasticsearch 将索引存储为文档,这是存储的基本单位。这些是简单的JSON对象。我们可以使用类型来细分文档中的类似类型的数据。指数是文档的逻辑分区。通常情况下,我们可以将索引水平地分割成分片,以实现可扩展性。此外,我们还可以复制分片以实现容错。

Logstash is a log aggregator that collects data from various input sources It also executes different transformations and enhancements and ships it to an output destination. Since Logstash has a larger footprint, we have Beats, which are lightweight data shippers that we can install as agents on our servers. Finally, Kibana is a visualization layer that works on top of Elasticsearch.

Logstash一种日志聚合器,它从各种输入源收集数据,还执行不同的转换和增强,并将其运送到输出目的地。由于Logstash的占用空间较大,我们有Beats,它是轻量级的数据运送器,我们可以将其作为代理安装在我们的服务器上。最后,Kibana工作在Elasticsearch之上的可视化层

Together, these products offer a complete suite to perform aggregation, processing, storage, and analysis of the log data:

这些产品一起提供了一个完整的套件来执行日志数据的聚集、处理、存储和分析。

With these products, we can create a production-grade data pipeline for our log data. However, it’s quite possible, and in some cases also necessary, to extend this architecture to handle large volumes of log data. We can place a buffer like Kafka in front of Logstash to prevent downstream components from overwhelming it. Elastic Stack is quite flexible in that regard.

通过这些产品,我们可以为我们的日志数据创建一个生产级的数据管道。然而,很有可能,而且在某些情况下也有必要扩展这种架构以处理大量的日志数据。我们可以在Logstash前面放置一个像Kafka这样的缓冲器,以防止下游组件将其淹没。在这方面,Elastic Stack是相当灵活的。

9.2. Setup and Instrumentation

9.2.设置和仪表

The Elastic Stack, as we’ve seen earlier, comprises several products. We can, of course, install them independently. However, that is time-consuming. Fortunately, Elastic provides official Docker images to make this easy.

正如我们前面所看到的,Elastic Stack包括几个产品。当然,我们可以独立安装它们。然而,这很耗费时间。幸运的是,Elastic提供了官方的Docker镜像,使之变得简单。

Starting a single-node Elasticsearch cluster is as simple as running a Docker command:

启动一个单节点Elasticsearch集群就像运行一个Docker命令一样简单。

docker run -p 9200:9200 -p 9300:9300 \
  -e "discovery.type=single-node" \
  docker.elastic.co/elasticsearch/elasticsearch:7.13.0

Similarly, installing Kibana and connecting it to the Elasticsearch cluster is quite easy:

同样地,安装Kibana并将其连接到Elasticsearch集群也很容易。

docker run -p 5601:5601 \
  -e "ELASTICSEARCH_HOSTS=http://localhost:9200" \
  docker.elastic.co/kibana/kibana:7.13.0

Installing and configuring Logstash is a little more involved as we have to provide the necessary settings and pipeline for data processing. One of the simpler ways to achieve this is by creating a custom image on top of the official image:

安装和配置Logstash是比较麻烦的,因为我们必须为数据处理提供必要的设置和管道。实现这一目标的一个比较简单的方法是在官方图像的基础上创建一个自定义图像。

FROM docker.elastic.co/logstash/logstash:7.13.0
RUN rm -f /usr/share/logstash/pipeline/logstash.conf
ADD pipeline/ /usr/share/logstash/pipeline/
ADD config/ /usr/share/logstash/config/

Let’s see a sample configuration file for Logstash that integrates with Elasticsearch and Beats:

让我们看看与Elasticsearch和Beats集成的Logstash的配置文件样本。

input {
  tcp {
  port => 4560
  codec => json_lines
  }
  beats {
    host => "127.0.0.1"
    port => "5044"
  }
}
output{
  elasticsearch {
  hosts => ["localhost:9200"]
  index => "app-%{+YYYY.MM.dd}"
  document_type => "%{[@metadata][type]}"
  }
  stdout { codec => rubydebug }
}

There are several types of Beats available depending upon the data source. For our example, we’ll be using the Filebeat. Installing and configuring Beats can be best done with the help of a custom image:

根据数据源的不同,有几种类型的拍子可用。对于我们的例子,我们将使用Filebeat。安装和配置Beats最好借助于自定义图像。

FROM docker.elastic.co/beats/filebeat:7.13.0
COPY filebeat.yml /usr/share/filebeat/filebeat.yml
USER root
RUN chown root:filebeat /usr/share/filebeat/filebeat.yml
USER filebeat

Let’s look at a sample filebeat.yml for a Spring Boot application:

让我们看看一个Spring Boot应用程序的样本filebeat.yml

filebeat.inputs:
- type: log
enabled: true
paths:
  - /tmp/math-service.log
output.logstash:
hosts: ["localhost:5044"]

This is a very cursory but complete explanation of the installation and configuration for the Elastic Stack. It’s beyond the scope of this tutorial to go into all the details.

这是对Elastic Stack的安装和配置的一个非常粗略但完整的解释。所有的细节都超出了本教程的范围。

9.3. Log Analysis

9.3.日志分析

Kibana provides a very intuitive and powerful visualization tool for our logs. We can access the Kibana interface at its default URL, http://localhost:5601. We can select a visualization and create a dashboard for our application.

Kibana为我们的日志提供了一个非常直观和强大的可视化工具。我们可以通过Kibana的默认URL访问它的界面,http://localhost:5601。我们可以选择一个可视化,并为我们的应用程序创建一个仪表板。

Let’s see a sample dashboard:

让我们来看看一个样本仪表盘。

Kibana offers quite extensive capabilities to query and filter the log data. These are beyond the scope of this tutorial.

Kibana提供了相当广泛的功能来查询和过滤日志数据。这些超出了本教程的范围。

10. The Future of Observability

10.可观察性的未来

Now, we’ve seen why observability is a key concern for distributed systems. We’ve also gone through some of the popular options for handling different types of telemetry data that can enable us to achieve observability. However, the fact remains, it’s still quite complex and time-consuming to assemble all the pieces. We have to handle a lot of different products.

现在,我们已经看到为什么可观察性是分布式系统的一个关键问题。我们也经历了一些处理不同类型遥测数据的流行方案,这些方案可以使我们实现可观察性。然而,事实是,要组装所有的部件仍然相当复杂和耗时。我们必须要处理很多不同的产品。

One of the key advancements in this area is OpenTelemetry, a sandbox project in the CNCF. Basically, OpenTelemetry has been formed through a careful merger of OpenTracing and OpenCensus projects. Obviously, this makes sense as we’ll only have to deal with a single abstraction for both traces and metrics.

这一领域的关键进展之一是OpenTelemetry,这是 CNCF 中的一个沙盒项目。基本上,OpenTelemetry是通过OpenTracing和OpenCensus项目的精心合并而形成的。显然,这是有意义的,因为我们只需要处理一个单一的跟踪和度量的抽象概念。

What’s more, OpenTelemetry has a plan to support logs and make them a complete observability framework for distributed systems. Further, OpenTelemetry has support for several languages and integrates well with popular frameworks and libraries. Also, OpenTelemetry is backward compatible with OpenTracing and OpenCensus via software bridges.

更重要的是,OpenTelemetry有一个支持日志的计划,并使其成为分布式系统的完整可观察性框架。此外,OpenTelemetry支持多种语言,并能很好地与流行的框架和库集成。另外,OpenTelemetry通过软件桥与OpenTracing和OpenCensus向后兼容。

OpenTelemetry is still in progress, and we can expect this to mature in the coming days. Meanwhile, to ease our pain, several observability platforms combine many of the products discussed earlier to offer a seamless experience. For instance, Logz.io combines the power of ELK, Prometheus, and Jaeger to offer a scalable platform as a service.

OpenTelemetry仍在进行中,我们可以期待它在未来几天内成熟起来。同时,为了减轻我们的痛苦,几个可观察性平台结合了前面讨论的许多产品,以提供无缝体验。例如,Logz.io结合了 ELK、Prometheus 和 Jaeger 的力量,提供了一个可扩展的平台即服务。

The observability space is fast maturing with new products coming into the market with innovative solutions. For instance, Micrometer provides a vendor-neutral facade over the instrumentation clients for several monitoring systems. Recently, OpenMetrics has released its specification for creating a de facto standard for transmitting cloud-native metrics at scale.

随着具有创新解决方案的新产品进入市场,可观测性空间正在迅速成熟。例如,Micrometer为几个监测系统的仪器客户端提供了一个供应商中立的门面。最近,OpenMetrics发布了其规范,以创建一个用于大规模传输云原生度量的事实标准。

11. Conclusion

11.结语

In this tutorial, we went through the basics of observability and its implications in a distributed system. We also implemented some of the popular options today for achieving observability in a simple distributed system.

在本教程中,我们了解了可观察性的基础知识及其在分布式系统中的意义。我们还实现了一些当今流行的选项,以便在一个简单的分布式系统中实现可观察性。

This allowed us to understand how OpenTracing, OpenCensus, and ELK can help us build an observable software system. Finally, we discussed some of the new developments in this area and how we can expect observability to grow and mature in the future.

这让我们了解到OpenTracing、OpenCensus和ELK如何帮助我们建立一个可观察的软件系统。最后,我们讨论了这个领域的一些新发展,以及我们如何期待可观察性在未来的发展和成熟。