1. Overview
1.概述
In this tutorial, we’ll take a dive into the MongoDB Aggregation framework using the MongoDB Java driver.
在本教程中,我们将使用MongoDB Java驱动深入了解MongoDB聚合框架。
We’ll first look at what aggregation means conceptually, and then set up a dataset. Finally, we’ll see various aggregation techniques in action using Aggregates builder.
我们将首先从概念上看一下聚合的含义,然后设置一个数据集。最后,我们将使用Aggregates builder看到各种聚合技术的应用。
2. What Are Aggregations?
2.什么是聚合?
Aggregations are used in MongoDB to analyze data and derive meaningful information out of it.
在MongoDB中,聚合被用来分析数据并从中得出有意义的信息。
These are usually performed in various stages, and the stages form a pipeline – such that the output of one stage is passed on as input to the next stage.
这些通常在不同的阶段进行,这些阶段形成一个管道–这样一个阶段的输出被传递给下一个阶段的输入。
The most commonly used stages can be summarized as:
最常用的阶段可以归纳为:。
Stage | SQL Equivalent | Description |
---|---|---|
project | SELECT | selects only the required fields, can also be used to compute and add derived fields to the collection |
match | WHERE | filters the collection as per specified criteria |
group | GROUP BY | gathers input together as per the specified criteria (e.g. count, sum) to return a document for each distinct grouping |
sort | ORDER BY | sorts the results in ascending or descending order of a given field |
count | COUNT | counts the documents the collection contains |
limit | LIMIT | limits the result to a specified number of documents, instead of returning the entire collection |
out | SELECT INTO NEW_TABLE | writes the result to a named collection; this stage is only acceptable as the last in a pipeline |
The SQL Equivalent for each aggregation stage is included above to give us an idea of what the said operation means in the SQL world.
每个聚合阶段的SQL等价物都包含在上面,以便让我们了解上述操作在SQL世界中的含义。。
We’ll look at Java code samples for all of these stages shortly. But before that, we need a database.
我们很快就会看到所有这些阶段的Java代码样本。但在这之前,我们需要一个数据库。
3. Database Setup
3.数据库设置
3.1. Dataset
3.1.数据集
The first and foremost requirement for learning anything database-related is the dataset itself!
学习任何与数据库有关的东西的首要条件是数据集本身。
For the purpose of this tutorial, we’ll use a publicly available restful API endpoint that provides comprehensive information about all the countries of the world. This API gives us a lot of data points for a country in a convenient JSON format. Some of the fields that we’ll be using in our analysis are:
在本教程中,我们将使用一个公开可用的restful API端点,它提供了关于世界上所有国家的全面信息。该API以方便的JSON格式为我们提供了一个国家的大量数据点。我们将在分析中使用的一些字段是。
- name – the name of the country; for example, United States of America
- alpha3Code – a shortcode for the country name; for example, IND (for India)
- region – the region the country belongs to; for example, Europe
- area – the geographical area of the country
- languages – official languages of the country in an array format; for example, English
- borders – an array of neighboring countries’ alpha3Codes
Now let’s see how to convert this data into a collection in a MongoDB database.
现在让我们看看如何将这些数据转换为MongoDB数据库中的一个集合。
3.2. Importing to MongoDB
3.2.导入到MongoDB
First, we need to hit the API endpoint to get all countries and save the response locally in a JSON file. The next step is to import it into MongoDB using the mongoimport command:
首先,我们需要点击API端点以获取所有国家/地区,并将响应保存在本地的JSON文件中。下一步是使用mongoimport命令将其导入MongoDB。
mongoimport.exe --db <db_name> --collection <collection_name> --file <path_to_file> --jsonArray
Successful import should give us a collection with 250 documents.
成功的导入应该给我们一个有250个文件的集合。
4. Aggregation Samples in Java
4.Java中的聚合样本
Now that we have the bases covered, let’s get into deriving some meaningful insights from the data we have for all the countries. We’ll use several JUnit tests for this purpose.
现在我们已经有了基础,让我们开始从我们拥有的所有国家的数据中获得一些有意义的洞察力。为此,我们将使用几个JUnit测试。
But before we do that, we need to make a connection to the database:
但在这之前,我们需要建立一个与数据库的连接。
@BeforeClass
public static void setUpDB() throws IOException {
mongoClient = MongoClients.create();
database = mongoClient.getDatabase(DATABASE);
collection = database.getCollection(COLLECTION);
}
In all the examples that follow, we’ll be using the Aggregates helper class provided by the MongoDB Java driver.
在接下来的所有示例中,我们将使用Aggregates帮助类,该类由MongoDB Java 驱动程序提供。
For better readability of our snippets, we can add a static import:
为了使我们的片段有更好的可读性,我们可以添加一个静态导入。
import static com.mongodb.client.model.Aggregates.*;
4.1. match and count
4.1.match和count
To begin with, let’s start with something simple. Earlier we noted that the dataset contains information about languages.
首先,让我们从简单的事情开始。先前我们注意到,数据集包含了关于语言的信息。
Now, let’s say we want to check the number of countries in the world where English is an official language:
现在,假设我们想检查世界上有多少国家以英语为官方语言。
@Test
public void givenCountryCollection_whenEnglishSpeakingCountriesCounted_thenNinetyOne() {
Document englishSpeakingCountries = collection.aggregate(Arrays.asList(
match(Filters.eq("languages.name", "English")),
count())).first();
assertEquals(91, englishSpeakingCountries.get("count"));
}
Here we are using two stages in our aggregation pipeline: match and count.
在这里,我们在聚合管道中使用两个阶段。match和count。
First, we filter out the collection to match only those documents that contain English in their languages field. These documents can be imagined as a temporary or intermediate collection that becomes the input for our next stage, count. This counts the number of documents in the previous stage.
首先,我们过滤掉这个集合,只匹配那些在languages 字段中包含English的文档。这些文档可以被想象成一个临时或中间的集合,成为我们下一阶段的输入,count.这是对上一阶段的文档数量进行统计。
Another point to note in this sample is the use of the method first. Since we know that the output of the last stage, count, is going to be a single record, this is a guaranteed way to extract out the lone resulting document.
在这个例子中需要注意的另一点是对first方法的使用。因为我们知道最后阶段的输出,count,将是一个单一的记录,这是一个保证提取出唯一的结果文档的方法。
4.2. group (with sum) and sort
4.2.group(带sum)和sort
In this example, our objective is to find out the geographical region containing the maximum number of countries:
在这个例子中,我们的目标是找出包含最大数量国家的地理区域。
@Test
public void givenCountryCollection_whenCountedRegionWise_thenMaxInAfrica() {
Document maxCountriedRegion = collection.aggregate(Arrays.asList(
group("$region", Accumulators.sum("tally", 1)),
sort(Sorts.descending("tally")))).first();
assertTrue(maxCountriedRegion.containsValue("Africa"));
}
As is evident, we are using group and sort to achieve our objective here.
显而易见,我们正在使用group和sort来实现我们的目标。
First, we gather the number of countries in each region by accumulating a sum of their occurrences in a variable tally. This gives us an intermediate collection of documents, each containing two fields: the region and the tally of countries in it. Then we sort it in the descending order and extract the first document to give us the region with maximum countries.
首先,我们通过在变量tally.中累积其出现次数的sum来收集每个地区的国家数量。这给了我们一个中间文件集,每个文件集包含两个字段:地区和其中的国家数量。 然后,我们按降序排序,并提取第一个文件,得到国家最多的地区。
4.3. sort, limit, and out
4.3.排序, 限制,和退出
Now let’s use sort, limit and out to extract the seven largest countries area-wise and write them into a new collection:
现在让我们使用sort,limit和out来提取面积最大的七个国家并将它们写入一个新的集合。
@Test
public void givenCountryCollection_whenAreaSortedDescending_thenSuccess() {
collection.aggregate(Arrays.asList(
sort(Sorts.descending("area")),
limit(7),
out("largest_seven"))).toCollection();
MongoCollection<Document> largestSeven = database.getCollection("largest_seven");
assertEquals(7, largestSeven.countDocuments());
Document usa = largestSeven.find(Filters.eq("alpha3Code", "USA")).first();
assertNotNull(usa);
}
Here, we first sorted the given collection in the descending order of area. Then, we used the Aggregates#limit method to restrict the result to seven documents only. Finally, we used the out stage to deserialize this data into a new collection called largest_seven. This collection can now be used in the same way as any other – for example, to find if it contains USA.
在这里,我们首先按照区域的降序对给定的集合进行排序。然后,我们使用Aggregates#limit方法,将结果限制在七个文件。最后,我们使用out阶段将这些数据反序列化为一个名为largest_seven的新集合。这个集合现在可以像其他集合一样被使用–例如,如果它包含USA.,就可以查找。
4.4. project, group (with max), match
4.4.项目,组(含最大),匹配
In our last sample, let’s try something trickier. Say we need to find out how many borders each country shares with others, and what is the maximum such number.
在我们的最后一个样本中,让我们尝试更棘手的东西。假设我们需要找出每个国家与其他国家有多少条边界,以及最大的边界数是多少。
Now in our dataset, we have a borders field, which is an array listing alpha3Codes for all bordering countries of the nation, but there isn’t any field directly giving us the count. So we’ll need to derive the number of borderingCountries using project:
现在在我们的数据集中,我们有一个borders字段,它是一个数组,列出了该国家所有接壤国家的alpha3Codes,但没有任何字段直接给我们计数。所以我们需要使用project来得出borderingCountries的数量。
@Test
public void givenCountryCollection_whenNeighborsCalculated_thenMaxIsFifteenInChina() {
Bson borderingCountriesCollection = project(Projections.fields(Projections.excludeId(),
Projections.include("name"), Projections.computed("borderingCountries",
Projections.computed("$size", "$borders"))));
int maxValue = collection.aggregate(Arrays.asList(borderingCountriesCollection,
group(null, Accumulators.max("max", "$borderingCountries"))))
.first().getInteger("max");
assertEquals(15, maxValue);
Document maxNeighboredCountry = collection.aggregate(Arrays.asList(borderingCountriesCollection,
match(Filters.eq("borderingCountries", maxValue)))).first();
assertTrue(maxNeighboredCountry.containsValue("China"));
}
After that, as we saw before, we’ll group the projected collection to find the max value of borderingCountries. One thing to point out here is that the max accumulator gives us the maximum value as a number, not the entire Document containing the maximum value. We need to perform match to filter out the desired Document if any further operations are to be performed.
之后,就像我们之前看到的,我们将分组投射的集合来找到borderingCountries的最大值。这里需要指出的一点是,max累加器给我们的是作为数字的最大值,而不是包含最大值的整个Document。如果要进行任何进一步的操作,我们需要执行match来过滤出想要的Document。
5. Conclusion
5.总结
In this article, we saw what are MongoDB aggregations, and how to apply them in Java using an example dataset.
在这篇文章中,我们看到了什么是MongoDB聚合,以及如何使用一个示例数据集在Java中应用它们。
We used four samples to illustrate the various aggregation stages to form a basic understanding of the concept. There are umpteen possibilities for data analytics that this framework offers which can be explored further.
我们用四个样本来说明各个聚合阶段,以形成对这个概念的基本理解。这个框架为数据分析提供了无数的可能性,可以进一步探索。
For further reading, Spring Data MongoDB provides an alternative way to handle projections and aggregations in Java.
如需进一步阅读,Spring Data MongoDB提供了在Java中处理projections和aggregations的其他方法。
As always, source code is available over on GitHub.
一如既往,源代码可在GitHub上获取。