Introduction to Spark Graph Processing with GraphFrames – 使用GraphFrames的Spark图形处理简介

最后修改: 2019年 12月 1日

中文/混合/英文(键盘快捷键:t)

1. Introduction

1.绪论

Graph processing is useful for many applications from social networks to advertisements. Inside a big data scenario, we need a tool to distribute that processing load.

图形处理对于从社交网络到广告的许多应用都很有用。在大数据场景中,我们需要一个工具来分配该处理负载。

In this tutorial, we’ll load and explore graph possibilities using Apache Spark in Java. To avoid complex structures, we’ll be using an easy and high-level Apache Spark graph API: the GraphFrames API.

在本教程中,我们将在Java中使用Apache Spark加载并探索图的可能性。为了避免复杂的结构,我们将使用一个简单而高级的Apache Spark图API:GraphFrames API。

2. Graphs

2. 图形

First of all, let’s define a graph and its components. A graph is a data structure having edges and vertices. The edges carry information that represents relationships between the vertices.

首先,我们来定义一个图和它的组成部分。图是一个具有边和顶点的数据结构。边携带的信息代表顶点之间的关系。

The vertices are points in an n-dimensional space, and edges connect the vertices according to their relationships:

顶点是n维空间中的点,而边则根据它们的关系连接顶点。

In the image above, we have a social network example. We can see the vertices represented by letters and the edges carrying which kind of relationship is between the vertices.

在上面的图片中,我们有一个社会网络的例子。我们可以看到用字母表示的顶点和承载顶点之间关系的边。

3. Maven Setup

3.Maven设置

Now, let’s start the project by setting up the Maven configuration.

现在,让我们通过设置Maven配置来启动该项目。

Let’s add spark-graphx 2.11, graphframes, and spark-sql 2.11:

让我们添加spark-graphx 2.11,graphframes, 和spark-sql 2.11

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-graphx_2.11</artifactId>
    <version>2.4.4</version>
</dependency>
<dependency>
   <groupId>graphframes</groupId>
   <artifactId>graphframes</artifactId>
   <version>0.7.0-spark2.4-s_2.11</version>
</dependency>
<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-sql_2.11</artifactId>
   <version>2.4.4</version>
</dependency>

These artifact versions support Scala 2.11.

这些工件版本支持Scala 2.11。

Also, it so happens that GraphFrames is not in Maven Central. So, let’s add the needed Maven repository, too:

另外,GraphFrames恰好不在Maven中心。因此,我们也来添加所需的Maven仓库。

<repositories>
     <repository>
          <id>SparkPackagesRepo</id>
          <url>http://dl.bintray.com/spark-packages/maven</url>
     </repository>
</repositories>

4. Spark Configuration

4.星火配置

In order to work with GraphFrames, we’ll need to download Hadoop and define the HADOOP_HOME environment variable.

为了使用 GraphFrames,我们需要下载Hadoop并定义HADOOP_HOME环境变量。

In the case of Windows as the operating system, we’ll also download the appropriate winutils.exe to the HADOOP_HOME/bin folder.

在Windows作为操作系统的情况下,我们还将下载相应的winutils.exeHADOOP_HOME/bin文件夹。

Next, let’s begin our code by creating the basic configuration:

接下来,让我们通过创建基本配置来开始我们的代码。

SparkConf sparkConf = new SparkConf()
  .setAppName("SparkGraphFrames")
  .setMaster("local[*]");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);

We’ll also need to create a SparkSession:

我们还需要创建一个SparkSession

SparkSession session = SparkSession.builder()
  .appName("SparkGraphFrameSample")
  .config("spark.sql.warehouse.dir", "/file:C:/temp")
  .sparkContext(javaSparkContext.sc())
  .master("local[*]")
  .getOrCreate();

5. Graph Construction

5.图形构造

Now, we’re all set to start with our main code. So, let’s define the entities for our vertices and edges, and create the GraphFrame instance.

现在,我们已经准备好开始编写我们的主代码了。所以,让我们为我们的顶点和边定义实体,并创建GraphFrame实例。

We’ll work on the relationships between users from a hypothetical social network.

我们将从一个假设的社会网络中研究用户之间的关系。

5.1. Data

5.1 数据

First, for this example, let’s define both entities as User and Relationship:

首先,对于这个例子,让我们把两个实体定义为UserRelationship

public class User {
    private Long id;
    private String name;
    // constructor, getters and setters
}
 
public class Relationship implements Serializable {
    private String type;
    private String src;
    private String dst;
    private UUID id;

    public Relationship(String type, String src, String dst) {
        this.type = type;
        this.src = src;
        this.dst = dst;
        this.id = UUID.randomUUID();
    }
    // getters and setters
}

Next, let’s define some User and Relationship instances:

接下来,让我们定义一些UserRelationship实例。

List<User> users = new ArrayList<>();
users.add(new User(1L, "John"));
users.add(new User(2L, "Martin"));
users.add(new User(3L, "Peter"));
users.add(new User(4L, "Alicia"));

List<Relationship> relationships = new ArrayList<>();
relationships.add(new Relationship("Friend", "1", "2"));
relationships.add(new Relationship("Following", "1", "4"));
relationships.add(new Relationship("Friend", "2", "4"));
relationships.add(new Relationship("Relative", "3", "1"));
relationships.add(new Relationship("Relative", "3", "4"));

5.2. GraphFrame Instance

5.2 GraphFrame实例

Now, in order to create and manipulate our graph of relationships, we’ll create an instance of GraphFrame. The GraphFrame constructor expects two Dataset<Row> instances, the first representing the vertices and the second, the edges:

现在,为了创建和操作我们的关系图,我们将创建一个GraphFrame的实例。GraphFrame构造函数期望有两个Dataset<Row>实例,第一个代表顶点,第二个代表边线。

Dataset<Row> userDataset = session.createDataFrame(users, User.class);
Dataset<Row> relationshipDataset = session.createDataFrame(relationships, Relation.class);

GraphFrame graph = new GraphFrame(userDataframe, relationshipDataframe);

At last, we’ll log our vertices and edges in the console to see how it looks:

最后,我们将在控制台中记录我们的顶点和边缘,看看它看起来如何。

graph.vertices().show();
graph.edges().show();
+---+------+
| id|  name|
+---+------+
|  1|  John|
|  2|Martin|
|  3| Peter|
|  4|Alicia|
+---+------+

+---+--------------------+---+---------+
|dst|                  id|src|     type|
+---+--------------------+---+---------+
|  2|622da83f-fb18-484...|  1|   Friend|
|  4|c6dde409-c89d-490...|  1|Following|
|  4|360d06e1-4e9b-4ec...|  2|   Friend|
|  1|de5e738e-c958-4e0...|  3| Relative|
|  4|d96b045a-6320-4a6...|  3| Relative|
+---+--------------------+---+---------+

6. Graph Operators

6.图形运算符

Now that we have a GraphFrame instance, let’s see what we can do with it.

现在我们有了一个GraphFrame实例,让我们看看我们能用它做什么。

6.1. Filter

6.1 过滤器

GraphFrames allows us to filter edges and vertices by a query.

GraphFrames允许我们通过查询来过滤边缘和顶点。

Next, then, let’s filter the vertices by the name property on User:

接下来,让我们通过User上的name属性过滤顶点。

graph.vertices().filter("name = 'Martin'").show();

At the console, we can see the result:

在控制台,我们可以看到结果。

+---+------+
| id|  name|
+---+------+
|  2|Martin|
+---+------+

Also, we can directly filter on the graph by calling filterEdges or filterVertices:

此外,我们还可以通过调用filterEdgesfilterVertices直接对图形进行过滤。

graph.filterEdges("type = 'Friend'")
  .dropIsolatedVertices().vertices().show();

Now, since we filtered the edges, we might still have some isolated vertices. So, we’ll call dropIsolatedVertices(). 

现在,由于我们过滤了边缘,我们可能仍然有一些孤立的顶点。因此,我们将调用dropIsolatedVertices()。

As a result, we have a subgraph, still a GraphFrame instance, with just the relationships that have “Friend” status:

因此,我们有一个子图,仍然是一个GraphFrame实例,其中只有具有 “朋友 “状态的关系。

+---+------+
| id|  name|
+---+------+
|  1|  John|
|  2|Martin|
|  4|Alicia|
+---+------+

6.2. Degrees

6.2.学位

Another interesting feature set is the degrees set of operations. These operations return the number of edges incident on each vertex.

另一个有趣的特征集是degrees操作集。这些操作返回每个顶点上incident的边的数量。

The degrees operation just returns the count of all edges of each vertex. On the other hand, inDegrees counts only incoming edges, and outDegrees counts only outgoing edges.

degrees操作只是返回每个顶点的所有边的计数。另一方面,inDegrees只计算入射边,outDegrees只计算出射边。

Let’s count the incoming degrees of all vertices in our graph:

让我们计算一下我们图中所有顶点的入射度。

graph.inDegrees().show();

As a result, we have a GraphFrame that shows the number of incoming edges to each vertex, excluding those with none:

因此,我们有一个GraphFrame,它显示了到每个顶点的传入边的数量,不包括那些没有的边。

+---+--------+
| id|inDegree|
+---+--------+
|  1|       1|
|  4|       3|
|  2|       1|
+---+--------+

7. Graph Algorithms

7.图形算法

GraphFrames also provides popular algorithms ready to use — let’s take a look at some of them.

GraphFrames还提供了可随时使用的流行算法–让我们看看其中一些算法。

7.1. Page Rank

7.1 网页排名

The Page Rank algorithm weighs the incoming edges to a vertex and transforms it into a score.

页面排名算法对进入一个顶点的边进行权衡,并将其转化为分数。

The idea is that each incoming edge represents an endorsement and makes the vertex more relevant in the given graph.

这个想法是,每条传入的边都代表一种认可,并使顶点在给定的图中更加相关。

For example, in a social network, if a person is followed by various people, he or she will be ranked highly.

例如,在社交网络中,如果一个人被不同的人关注,他或她的排名就会很高。

Running the page rank algorithm is quite straightforward:

运行页面排名算法是非常简单的。

graph.pageRank()
  .maxIter(20)
  .resetProbability(0.15)
  .run()
  .vertices()
  .show();

To configure this algorithm, we just need to provide:

要配置这种算法,我们只需要提供。

  • maxIter – the number of iterations of page rank to run – 20 is recommended, too few will decrease the quality, and too many will degrade the performance
  • resetProbability – the random reset probability (alpha) – the lower it is, the bigger the score spread between the winners and losers will be – valid ranges are from 0 to 1. Usually, 0.15 is a good score

The response is a similar GraphFrame, though this time we see an additional column giving the page rank of each vertex:

响应是一个类似的GraphFrame,尽管这次我们看到一个额外的列,给出了每个顶点的页面排名。

+---+------+------------------+
| id|  name|          pagerank|
+---+------+------------------+
|  4|Alicia|1.9393230468864597|
|  3| Peter|0.4848822786454427|
|  1|  John|0.7272991738542318|
|  2|Martin| 0.848495500613866|
+---+------+------------------+

In our graph, Alicia is the most relevant vertex, followed by Martin and John.

在我们的图中,艾丽西亚是最相关的顶点,其次是马丁和约翰。

7.2. Connected Components

7.2.连接的组件

The connected components algorithm finds isolated clusters or isolated sub-graphs. These clusters are sets of connected vertices in a graph where each vertex is reachable from any other vertex in the same set.

连接部件算法可以找到孤立的集群或孤立的子图。这些集群是图形中连接顶点的集合,其中每个顶点都可以从同一集合中的任何其他顶点到达。

We can call the algorithm without any parameters via the connectedComponents() method:

我们可以通过connectedComponents()方法调用该算法,而无需任何参数。

graph.connectedComponents().run().show();

The algorithm returns a GraphFrame containing each vertex and the component to which each is connected:

该算法返回一个GraphFrame,包含每个顶点和每个顶点所连接的组件。

+---+------+------------+
| id|  name|   component|
+---+------+------------+
|  1|  John|154618822656|
|  2|Martin|154618822656|
|  3| Peter|154618822656|
|  4|Alicia|154618822656|
+---+------+------------+

Our graph has only one component — this means that we do not have isolated sub-graphs. The component has an auto-generated id, which is 154618822656, in our case.

我们的图只有一个组件–这意味着我们没有孤立的子图。该组件有一个自动生成的ID,在我们的例子中,它是154618822656。

Although we have one more column here – the component id – our graph is still the same.

虽然我们在这里多了一列–组件ID–但我们的图仍然是一样的。

7.3. Triangle Counting

7.3.三角形计数

Triangle counting is commonly used as community detection and counting in a social network graph. A triangle is a set of three vertices, where each vertex has a relationship to the other two vertices in the triangle.

三角形计数通常被用作社会网络图中的社区检测和计数。三角形是一个由三个顶点组成的集合,其中每个顶点与三角形中的其他两个顶点有关系。

In a social network community, it’s easy to find a considerable number of triangles connected to each other.

在一个社会网络社区中,很容易发现有相当数量的三角形相互连接。

We can easily perform a triangle counting directly from our GraphFrame instance:

我们可以很容易地直接从我们的GraphFrame实例中进行三角形计数。

graph.triangleCount().run().show();

The algorithm also returns a GraphFrame with the number of triangles passing through each vertex.

该算法还返回一个GraphFrame,其中有经过每个顶点的三角形数量。

+-----+---+------+
|count| id|  name|
+-----+---+------+
|    1|  3| Peter|
|    2|  1|  John|
|    2|  4|Alicia|
|    1|  2|Martin|
+-----+---+------+

8. Conclusion

8.结语

Apache Spark is a great tool for computing a relevant amount of data in an optimized and distributed way. And, the GraphFrames library allows us to easily distribute graph operations over Spark.

Apache Spark是一个伟大的工具,可以以优化和分布式的方式计算相关的数据量。而且,GraphFrames库允许我们在Spark上轻松地分布图操作

As always, the complete source code for the example is available over on GitHub.

一如既往,该示例的完整源代码可在GitHub上获得over