1. Overview
1.概述
In this tutorial, we’ll introduce Apache Beam and explore its fundamental concepts.
在本教程中,我们将介绍Apache Beam并探讨其基本概念。
We’ll start by demonstrating the use case and benefits of using Apache Beam, and then we’ll cover foundational concepts and terminologies. Afterward, we’ll walk through a simple example that illustrates all the important aspects of Apache Beam.
我们将首先展示使用Apache Beam的用例和好处,然后我们将介绍基础概念和术语。之后,我们将通过一个简单的例子,说明Apache Beam的所有重要方面。
2. What Is Apache Beam?
2.什么是Apache Beam?
Apache Beam (Batch + strEAM) is a unified programming model for batch and streaming data processing jobs. It provides a software development kit to define and construct data processing pipelines as well as runners to execute them.
Apache Beam(批处理+strEAM)是一个统一的编程模型,用于批处理和流式数据处理工作。它提供了一个软件开发工具包,用于定义和构建数据处理管道,以及执行这些管道的运行器。
Apache Beam is designed to provide a portable programming layer. In fact, the Beam Pipeline Runners translate the data processing pipeline into the API compatible with the backend of the user’s choice. Currently, these distributed processing backends are supported:
Apache Beam旨在提供一个可移植的编程层。事实上,Beam Pipeline Runners将数据处理管道翻译成与用户选择的后端兼容的API。目前,支持这些分布式处理后端。
- Apache Apex
- Apache Flink
- Apache Gearpump (incubating)
- Apache Samza
- Apache Spark
- Google Cloud Dataflow
- Hazelcast Jet
3. Why Apache Beam?
3.为什么是Apache Beam?
Apache Beam fuses batch and streaming data processing, while others often do so via separate APIs. Consequently, it’s very easy to change a streaming process to a batch process and vice versa, say, as requirements change.
Apache Beam融合了批处理和流式数据处理,而其他公司通常是通过单独的API进行处理。因此,很容易将流式处理改为批处理,反之亦然,比如说,随着需求的变化。
Apache Beam raises portability and flexibility. We focus on our logic rather than the underlying details. Moreover, we can change the data processing backend at any time.
Apache Beam提高了可移植性和灵活性。我们专注于我们的逻辑而不是底层细节。此外,我们可以在任何时候改变数据处理的后端。
There are Java, Python, Go, and Scala SDKs available for Apache Beam. Indeed, everybody on the team can use it with their language of choice.
有Java、Python、Go和Scala SDK可用于Apache Beam。事实上,团队中的每个人都可以用他们选择的语言来使用它。
4. Fundamental Concepts
4.基本概念
With Apache Beam, we can construct workflow graphs (pipelines) and execute them. The key concepts in the programming model are:
通过Apache Beam,我们可以构建工作流图(管道)并执行它们。编程模型中的关键概念是。
- PCollection – represents a data set which can be a fixed batch or a stream of data
- PTransform – a data processing operation that takes one or more PCollections and outputs zero or more PCollections
- Pipeline – represents a directed acyclic graph of PCollection and PTransform, and hence, encapsulates the entire data processing job
- PipelineRunner – executes a Pipeline on a specified distributed processing backend
Simply put, a PipelineRunner executes a Pipeline, and a Pipeline consists of PCollection and PTransform.
简单地说,PipelineRunner执行Pipeline,Pipeline由PCollection和PTransform组成。
5. Word Count Example
5.字数示例
Now that we’ve learned the basic concepts of Apache Beam, let’s design and test a word count task.
现在我们已经学会了Apache Beam的基本概念,让我们来设计和测试一个字数统计任务。
5.1. Constructing a Beam Pipeline
5.1.构建梁式管道
Designing the workflow graph is the first step in every Apache Beam job. Let’s define the steps of a word count task:
设计工作流程图是每个Apache Beam工作的第一步。让我们来定义一个字数统计任务的步骤。
- Read the text from a source.
- Split the text into a list of words.
- Lowercase all words.
- Trim punctuations.
- Filter stopwords.
- Count each unique word.
To achieve this, we’ll need to convert the above steps into a single Pipeline using PCollection and PTransform abstractions.
为了实现这一点,我们需要使用PCollection和PTransform抽象,将上述步骤转换为一个Pipeline。
5.2. Dependencies
依赖性
Before we can implement our workflow graph, we should add Apache Beam’s core dependency to our project:
在实现我们的工作流图之前,我们应该将Apache Beam的核心依赖项添加到我们的项目中。
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-core</artifactId>
<version>${beam.version}</version>
</dependency>
Beam Pipeline Runners rely on a distributed processing backend to perform tasks. Let’s add DirectRunner as a runtime dependency:
Beam Pipeline Runners依靠分布式处理后端来执行任务。让我们把DirectRunner作为一个运行时依赖项。
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-direct-java</artifactId>
<version>${beam.version}</version>
<scope>runtime</scope>
</dependency>
Unlike other Pipeline Runners, DirectRunner doesn’t need any additional setup, which makes it a good choice for starters.
与其他管道运行器不同,DirectRunner不需要任何额外的设置,这使它成为初学者的好选择。
5.3. Implementation
5.3.实施
Apache Beam utilizes the Map-Reduce programming paradigm (same as Java Streams). In fact, it’s a good idea to have a basic concept of reduce(), filter(), count(), map(), and flatMap() before we continue.
Apache Beam利用了Map-Reduce编程范式(与Java Streams相同)。事实上,对reduce()、filter()有一个基本的概念是很好的。count(), map(), 和flatMap()之后我们继续。
Creating a Pipeline is the first thing we do:
创建一个管道是我们做的第一件事。
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
Now we apply our six-step word count task:
现在我们应用我们的六步字数计算任务。
PCollection<KV<String, Long>> wordCount = p
.apply("(1) Read all lines",
TextIO.read().from(inputFilePath))
.apply("(2) Flatmap to a list of words",
FlatMapElements.into(TypeDescriptors.strings())
.via(line -> Arrays.asList(line.split("\\s"))))
.apply("(3) Lowercase all",
MapElements.into(TypeDescriptors.strings())
.via(word -> word.toLowerCase()))
.apply("(4) Trim punctuations",
MapElements.into(TypeDescriptors.strings())
.via(word -> trim(word)))
.apply("(5) Filter stopwords",
Filter.by(word -> !isStopWord(word)))
.apply("(6) Count words",
Count.perElement());
The first (optional) argument of apply() is a String that is only for better readability of the code. Here is what each apply() does in the above code:
apply()的第一个(可选)参数是一个String,这只是为了让代码更好地阅读。下面是上述代码中每个apply()的作用。
- First, we read an input text file line by line using TextIO.
- Splitting each line by whitespaces, we flat-map it to a list of words.
- Word count is case-insensitive, so we lowercase all words.
- Earlier, we split lines by whitespace, ending up with words like “word!” and “word?”, so we remove punctuations.
- Stopwords such as “is” and “by” are frequent in almost every English text, so we remove them.
- Finally, we count unique words using the built-in function Count.perElement().
As mentioned earlier, pipelines are processed on a distributed backend. It’s not possible to iterate over a PCollection in-memory since it’s distributed across multiple backends. Instead, we write the results to an external database or file.
如前所述,流水线是在分布式后端上处理的。由于PCollection分布在多个后端,所以不可能在内存中迭代。相反,我们将结果写入外部数据库或文件。
First, we convert our PCollection to String. Then, we use TextIO to write the output:
首先,我们将我们的PCollection转换成String。然后,我们使用TextIO来写入输出。
wordCount.apply(MapElements.into(TypeDescriptors.strings())
.via(count -> count.getKey() + " --> " + count.getValue()))
.apply(TextIO.write().to(outputFilePath));
Now that our Pipeline definition is complete, we can run and test it.
现在,我们的Pipeline定义已经完成,我们可以运行和测试它。
5.4. Running and Testing
5.4.运行和测试
So far, we’ve defined a Pipeline for the word count task. At this point, let’s run the Pipeline:
到目前为止,我们已经为字数统计任务定义了一条Pipeline。在这一点上,让我们运行Pipeline。
p.run().waitUntilFinish();
On this line of code, Apache Beam will send our task to multiple DirectRunner instances. Consequently, several output files will be generated at the end. They’ll contain things like:
在这一行代码中,Apache Beam将把我们的任务发送到多个DirectRunner实例。因此,在最后会产生几个输出文件。它们会包含一些东西,比如。
...
apache --> 3
beam --> 5
rocks --> 2
...
Defining and running a distributed job in Apache Beam is as simple and expressive as this. For comparison, word count implementation is also available on Apache Spark, Apache Flink, and Hazelcast Jet.
在Apache Beam中定义和运行一个分布式作业就像这样简单而富有表现力。作为比较,字数实现也可以在Apache Spark、Apache Flink和Hazelcast Jet上使用。
6. Where Do We Go From Here?
6.我们该何去何从?
We successfully counted each word from our input file, but we don’t have a report of the most frequent words yet. Certainly, sorting a PCollection is a good problem to solve as our next step.
我们成功地统计了输入文件中的每个词,但我们还没有一个最频繁的词的报告。当然,对PCollection进行排序是我们下一步要解决的一个好问题。
Later, we can learn more about Windowing, Triggers, Metrics, and more sophisticated Transforms. Apache Beam Documentation provides in-depth information and reference material.
稍后,我们可以了解更多关于视窗、触发器、度量衡和更复杂的变换。Apache Beam文档提供深入的信息和参考资料。
7. Conclusion
7.结语
In this tutorial, we learned what Apache Beam is and why it’s preferred over alternatives. We also demonstrated basic concepts of Apache Beam with a word count example.
在本教程中,我们了解了什么是Apache Beam,以及为什么它比其他替代方案更受欢迎。我们还通过一个字数统计的例子展示了Apache Beam的基本概念。
The code for this tutorial is available over on GitHub.
本教程的代码可在GitHub上获得,。