Introduction to Apache Spark – Apache Spark简介

最后修改: 2017年 10月 28日

中文/混合/英文(键盘快捷键:t)

1. Introduction

1.介绍

Apache Spark is an open-source cluster-computing framework. It provides elegant development APIs for Scala, Java, Python, and R that allow developers to execute a variety of data-intensive workloads across diverse data sources including HDFS, Cassandra, HBase, S3 etc.

Apache Spark是一个开源的集群计算框架。它为Scala、Java、Python和R提供了优雅的开发API,使开发人员能够在不同的数据源(包括HDFS、Cassandra、HBase、S3等)上执行各种数据密集型工作负载。

Historically, Hadoop’s MapReduce prooved to be inefficient for some iterative and interactive computing jobs, which eventually led to the development of Spark. With Spark, we can run logic up to two orders of magnitude faster than with Hadoop in memory, or one order of magnitude faster on disk.

从历史上看,Hadoop的MapReduce对于一些迭代和交互式计算工作来说效率很低,这最终导致了Spark的发展。利用Spark,我们在内存中运行逻辑的速度比Hadoop快两个数量级,或在磁盘上快一个数量级

2. Spark Architecture

2.Spark架构

Spark applications run as independent sets of processes on a cluster as described in the below diagram:

如下面所述,Spark应用程序在集群中作为独立的进程组运行。

cluster overview

 

These set of processes are coordinated by the SparkContext object in your main program (called the driver program). SparkContext connects to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications.

这些进程的集合由你的主程序(称为驱动程序)中的SparkContext对象协调。SparkContext连接到几种类型的集群管理器(Spark自己的独立集群管理器、Mesos或YARN),这些集群管理器在各应用程序之间分配资源。

Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application.

一旦连接,Spark就会在集群中的节点上获得执行器,这些执行器是为你的应用程序运行计算和存储数据的进程。

Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

接下来,它将你的应用程序代码(由传递给SparkContext的JAR或Python文件定义)发送给执行器。最后,SparkContext向执行器发送任务以运行

3. Core Components

3.核心成分

The following diagram gives the clear picture of the different components of Spark:

下面的给出了Spark不同组件的清晰图像。

Components of Spark

 

3.1. Spark Core

3.1.斯帕克核心

Spark Core component is accountable for all the basic I/O functionalities, scheduling and monitoring the jobs on spark clusters, task dispatching, networking with different storage systems, fault recovery, and efficient memory management.

Spark核心组件负责所有基本的I/O功能,调度和监控Spark集群上的作业,任务调度,与不同的存储系统联网,故障恢复,以及有效的内存管理。

Unlike Hadoop, Spark avoids shared data to be stored in intermediate stores like Amazon S3 or HDFS by using a special data structure known as RDD (Resilient Distributed Datasets).

与Hadoop不同,Spark通过使用被称为RDD(弹性分布式数据集)的特殊数据结构,避免了将共享数据存储在Amazon S3或HDFS等中间存储中。

Resilient Distributed Datasets are immutable, a partitioned collection of records that can be operated on – in parallel and allows – fault-tolerant ‘in-memory’ computations.

弹性分布式数据集是不可改变的,是一个分区的记录集合,可以进行操作–并行并允许–容错的 “内存 “计算

RDDs support two kinds of operations:

RDDs支持两种操作。

  • Transformation – Spark RDD transformation is a function that produces new RDD from the existing RDDs. The transformer takes RDD as input and produces one or more RDD as output. Transformations are lazy in nature i.e., they get execute when we call an action
  • Actiontransformations create RDDs from each other, but when we want to work with the actual data set, at that point action is performed. Thus, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system

An action is one of the ways of sending data from Executor to the driver.

动作是将数据从Executor发送到驱动程序的方式之一。

Executors are agents that are responsible for executing a task. While the driver is a JVM process that coordinates workers and execution of the task. Some of the actions of Spark are count and collect.

执行者是负责执行任务的代理。而驱动器是一个JVM进程,负责协调工人和任务的执行。Spark的一些动作是计数和收集。

3.2. Spark SQL

3.2.Spark SQL

Spark SQL is a Spark module for structured data processing. It’s primarily used to execute SQL queries. DataFrame constitutes the main abstraction for Spark SQL. Distributed collection of data ordered into named columns is known as a DataFrame in Spark.

Spark SQL是一个用于结构化数据处理的Spark模块。它主要用于执行SQL查询。DataFrame构成了Spark SQL的主要抽象概念。在Spark中,分布式的数据集合被命名为DataFrame

Spark SQL supports fetching data from different sources like Hive, Avro, Parquet, ORC, JSON, and JDBC. It also scales to thousands of nodes and multi-hour queries using the Spark engine – which provides full mid-query fault tolerance.

Spark SQL支持从Hive、Avro、Parquet、ORC、JSON和JDBC等不同来源获取数据。它还可以使用Spark引擎扩展到数以千计的节点和多小时的查询–它提供了完整的查询中期容错。

3.3. Spark Streaming

3.3.星火流媒体

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from a number of sources, such as Kafka, Flume, Kinesis, or TCP sockets.

Spark Streaming是核心Spark API的一个扩展,它能够对实时数据流进行可扩展、高吞吐量、容错的流处理。数据可以从多个来源摄取,如Kafka、Flume、Kinesis或TCP套接字。

Finally, processed data can be pushed out to file systems, databases, and live dashboards.

最后,经过处理的数据可以被推送到文件系统、数据库和实时仪表盘上。

3.4. Spark Mlib

3.4.Spark Mlib

MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:

MLlib是Spark的机器学习(ML)库。它的目标是使实际的机器学习变得可扩展和容易。在一个较高的水平上,它提供了一些工具,如。

  • ML Algorithms – common learning algorithms such as classification, regression, clustering, and collaborative filtering
  • Featurization – feature extraction, transformation, dimensionality reduction, and selection
  • Pipelines – tools for constructing, evaluating, and tuning ML Pipelines
  • Persistence – saving and load algorithms, models, and Pipelines
  • Utilities – linear algebra, statistics, data handling, etc.

3.5. Spark GraphX

3.5.Spark GraphX

GraphX is a component for graphs and graph-parallel computations. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.

GraphX是一个用于图和图并行计算的组件。在高层次上,GraphX通过引入一个新的图抽象来扩展Spark RDD:一个有向多图,每个顶点和边都附有属性。

To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages).

为了支持图形计算,GraphX公开了一组基本操作符(例如,subgraphjoinVerticesaggregateMessages)。

In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.

此外,GraphX包括一个不断增长的图算法和构建器集合,以简化图分析任务。

4. “Hello World” in Spark

4.Spark中的 “Hello World”

Now that we understand the core components, we can move on to simple Maven-based Spark project – for calculating word counts.

现在我们了解了核心组件,我们可以继续研究简单的基于Maven的Spark项目–用于计算字数

We’ll be demonstrating Spark running in the local mode where all the components are running locally on the same machine where it’s the master node, executor nodes or Spark’s standalone cluster manager.

我们将演示Spark在本地模式下的运行,所有的组件都在同一台机器上运行,其中有主节点、执行节点或Spark的独立集群管理器。

4.1. Maven Setup

4.1.Maven设置

Let’s set up a Java Maven project with Spark-related dependencies in pom.xml file:

让我们在pom.xml文件中设置一个带有Spark相关依赖的Java Maven项目。

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
	<artifactId>spark-core_2.10</artifactId>
	<version>1.6.0</version>
    </dependency>
</dependencies>

4.2. Word Count – Spark Job

4.2.字数–火花作业

Let’s now write Spark job to process a file containing sentences and output distinct words and their counts in the file:

现在让我们写Spark作业来处理一个包含句子的文件,并输出文件中的不同词汇及其数量。

public static void main(String[] args) throws Exception {
    if (args.length < 1) {
        System.err.println("Usage: JavaWordCount <file>");
        System.exit(1);
    }
    SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount");
    JavaSparkContext ctx = new JavaSparkContext(sparkConf);
    JavaRDD<String> lines = ctx.textFile(args[0], 1);

    JavaRDD<String> words 
      = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());
    JavaPairRDD<String, Integer> ones 
      = words.mapToPair(word -> new Tuple2<>(word, 1));
    JavaPairRDD<String, Integer> counts 
      = ones.reduceByKey((Integer i1, Integer i2) -> i1 + i2);

    List<Tuple2<String, Integer>> output = counts.collect();
    for (Tuple2<?, ?> tuple : output) {
        System.out.println(tuple._1() + ": " + tuple._2());
    }
    ctx.stop();
}

Notice that we pass the path of the local text file as an argument to a Spark job.

注意,我们把本地文本文件的路径作为参数传递给Spark作业。

A SparkContext object is the main entry point for Spark and represents the connection to an already running Spark cluster. It uses SparkConf object for describing the application configuration. SparkContext is used to read a text file in memory as a JavaRDD object.

SparkContext对象是Spark的主要入口,代表与已经运行的Spark集群的连接。它使用SparkConf对象来描述应用程序的配置。SparkContext用于读取内存中的文本文件作为JavaRDD对象。

Next, we transform the lines JavaRDD object to words JavaRDD object using the flatmap method to first convert each line to space-separated words and then flatten the output of each line processing.

接下来,我们使用flatmap方法将行JavaRDD对象转换为字JavaRDD对象,首先将每一行转换为空格分隔的字,然后对每一行的输出进行扁平化处理。

We again apply transform operation mapToPair which basically maps each occurrence of the word to the tuple of words and count of 1.

我们再次应用转换操作mapToPair ,该操作基本上将每个出现的单词映射到单词和计数为1的元组中。

Then, we apply the reduceByKey operation to group multiple occurrences of any word with count 1 to a tuple of words and summed up the count.

然后,我们应用reduceByKey操作,将多次出现的任何计数为1的词分组为一个词组,并将计数相加。

Lastly, we execute collect RDD action to get the final results.

最后,我们执行collect RDD动作来获得最终结果。

4.3. Executing – Spark Job

4.3.执行 – Spark作业

Let’s now build the project using Maven to generate apache-spark-1.0-SNAPSHOT.jar in the target folder.

现在让我们用Maven构建该项目,在目标文件夹中生成apache-spark-1.0-SNAPSHOT.jar

Next, we need to submit this WordCount job to Spark:

接下来,我们需要将这个WordCount作业提交给Spark。

${spark-install-dir}/bin/spark-submit --class com.baeldung.WordCount 
  --master local ${WordCount-MavenProject}/target/apache-spark-1.0-SNAPSHOT.jar
  ${WordCount-MavenProject}/src/main/resources/spark_example.txt

Spark installation directory and WordCount Maven project directory needs to be updated before running above command.

在运行上述命令之前,需要更新Spark安装目录和WordCount Maven项目目录。

On submission couple of steps happens behind the scenes:

在提交的几个步骤中,背后发生了一些事情。

  1. From the driver code, SparkContext connects to cluster manager(in our case spark standalone cluster manager running locally)
  2. Cluster Manager allocates resources across the other applications
  3. Spark acquires executors on nodes in the cluster. Here, our word count application will get its own executor processes
  4. Application code (jar files) is sent to executors
  5. Tasks are sent by the SparkContext to the executors.

Finally, the result of spark job is returned to the driver and we will see the count of words in the file as the output:

最后,Spark作业的结果被返回给驱动程序,我们将看到文件中的字数作为输出。

Hello 1
from 2
Baledung 2
Keep 1
Learning 1
Spark 1
Bye 1

5. Conclusion

5.结论

In this article, we discussed the architecture and different components of Apache Spark. We also demonstrated a working example of a Spark job giving word counts from a file.

在这篇文章中,我们讨论了Apache Spark的架构和不同组件。我们还演示了一个Spark工作的例子,即从一个文件中提供字数。

As always, the full source code is available over on GitHub.

一如既往,完整的源代码可在GitHub上获得