Event-Driven Data with Apache Druid – 使用Apache Druid的事件驱动数据

最后修改: 2020年 6月 17日

中文/混合/英文(键盘快捷键:t)

1. Introduction

1.绪论

In this tutorial, we’ll understand how to work with event data and Apache Druid. We will cover the basics of event data and Druid architecture. As part of that, we’ll create a simple data pipeline leveraging various features of Druid that covers various modes of data ingestion and different ways to query the prepared data.

在本教程中,我们将了解如何使用事件数据和Apache Druid>。我们将介绍事件数据和 Druid 架构的基本知识。作为其中的一部分,我们将利用Druid的各种功能创建一个简单的数据管道,其中包括各种数据摄取模式和查询准备好的数据的不同方法。

2. Basic Concepts

2.基本概念

Before we plunge into the operation details of Apache Druid, let’s first go through some of the basic concepts. The space we are interested in is real-time analytics of event data on a massive scale.

在我们深入了解Apache Druid的操作细节之前,我们先来看看一些基本概念。我们感兴趣的空间是对大规模的事件数据进行实时分析。

Hence, it’s imperative to understand what we mean by event data and what does it require to analyze them in real-time at scale.

因此,当务之急是了解我们所说的事件数据是什么意思,以及大规模实时分析它们需要什么。

2.1. What Is Event Data?

2.1.什么是事件数据?

Event data refers to a piece of information about a change that occurs at a specific point in time. Event data is almost ubiquitous in present-day applications. From the classical application logs to modern-day sensor data generated by things, it’s practically everywhere. These are often characterized by machine-readable information generated at a massive scale.

事件数据是指关于在特定时间点发生的变化的信息。事件数据在当今的应用中几乎无处不在。从经典的应用日志到现代的由事物产生的传感器数据,它几乎无处不在。这些数据的特点是以大规模生成的机器可读信息。

They power several functions like prediction, automation, communication, and integration, to name a few. Additionally, they are of significance in event-driven architecture.

它们为一些功能提供动力,如预测、自动化、通信和集成,仅举几例。此外,它们在事件驱动的架构中具有重要意义。

2.2. What Is Apache Druid?

2.2.什么是Apache Druid?

Apache Druid is a real-time analytics database designed for fast analytics over event-oriented data. Druid was started in 2011, open-sourced under the GPL license in 2012, and moved to Apache License in 2015. It’s managed by the Apache Foundation with community contributions from several organizations. It provides real-time ingestion, fast query performance, and high availability.

Apache Druid是一个实时分析数据库,旨在对面向事件的数据进行快速分析。Druid开始于2011年,在2012年以GPL许可证开源,并在2015年转移到Apache许可证。它由Apache基金会管理,由几个组织的社区贡献。它提供实时摄取、快速查询性能和高可用性。

The name Druid refers to the fact that its architecture can shift to solve different types of data problems. It’s often employed in business intelligence applications to analyze a high volume of real-time and historical data.

Druid这个名字是指它的架构可以转变,以解决不同类型的数据问题。它经常被运用于商业智能应用,以分析大量的实时和历史数据。

3. Druid Architecture

3.德鲁伊建筑

Druid is a column-oriented and distributed data source written in Java. It’s capable of ingesting massive amounts of event data and offering low-latency queries on top of this data. Moreover, it offers the possibility to slice and dice the data arbitrarily.

Druid是用Java编写的面向列的分布式数据源。它能够摄取大量的事件数据,并在这些数据的基础上提供低延迟的查询。此外,它还提供了对数据进行任意切割的可能性。

It’s quite interesting to understand how Druid architecture supports these features. In this section, we’ll go through some of the important parts of Druid architecture.

了解Druid架构如何支持这些功能是相当有趣的。在本节中,我们将了解Druid架构的一些重要部分。

3.1. Data Storage Design

3.1 数据存储设计

It’s important to understand how Druid structures and stores its data, which allows for partitioning and distribution. Druid partitions the data by default during processing and stores them into chunks and segments:

了解Druid如何结构和存储其数据是很重要的,这可以实现分区和分布。Druid在处理过程中默认对数据进行分割,并将它们存储为块和段。

Druid stores data in what we know as “datasource”, which is logically similar to tables in relational databases. A Druid cluster can handle multiple data sources in parallel, ingested from various sources.

Druid 将数据存储在我们所知道的 “数据源”中,这在逻辑上类似于关系型数据库中的表。一个Druid集群可以并行地处理多个数据源,这些数据源来自不同的来源。

Each datasource is partitioned — based on time, by default, and further based on other attributes if so configured. A time range of data is known as a “chunk” — for instance, an hour’s data if data is partitioned by the hour.

每个数据源都被分区–默认情况下是基于时间,如果配置了其他属性,则进一步基于其他属性。数据的时间范围被称为 “块”–例如,如果数据是按小时划分的,则为一小时的数据。

Every chunk is further partitioned into one or more “segments”, which are single files comprising of many rows of data. A datasource may have anywhere from a few segments to millions of segments.

每个块被进一步划分为一个或多个 “段”,它们是由许多数据行组成的单个文件。一个数据源可能有从几个片段到数百万个片段的任何地方。

3.2. Druid Processes

3.2.德鲁伊进程

Druid has a multi-process and distributed architecture. Hence, each process can be scaled independently, allowing us to create flexible clusters. Let’s understand the important processes that are part of Druid:

Druid具有多进程和分布式架构。因此,每个进程都可以独立扩展,使我们可以创建灵活的集群。让我们了解一下作为Druid一部分的重要进程。

  • Coordinator: This process is mainly responsible for segment management and distribution and communicates with historical processes to load or drop segments based on configurations
  • Overlord: This is the main process that is responsible for accepting tasks, coordinating task distribution, creating locks around tasks, and returning status to callers
  • Broker: This is the process to which all queries are sent to be executed in a distributed cluster; it collects metadata from Zookeeper and routes queries to processes having the right segments
  • Router: This is an optional process that can be used to route queries to different broker processes, thus providing query isolation to queries for more important data
  • Historical: These are the processes that store queryable data; they maintain a constant connection with Zookeeper and watch for segment information that they have to load and serve
  • MiddleManager: These are the worker processes that execute the submitted tasks; they forward the tasks to Peons running in separate JVMs, thus providing resource and log isolation

3.3. External Dependencies

3.3. 外部依赖性

Apart from the core processes, Druid depends on several external dependencies for its cluster to function as expected.

除了核心进程之外,Druid 依赖于几个外部依赖,以使其集群按照预期的方式运作

Let’s see how a Druid cluster is formed together with core processes and external dependencies:

让我们看看一个Druid集群是如何与核心流程和外部依赖关系一起形成的。

Druid uses deep storage to store any data that has been ingested into the system. These are not used to respond to the queries but used as a backup of data and to transfer data between processes. These can be anything from a local filesystem to a distributed object store like S3 and HDFS.

Druid使用深层存储来存储任何被摄入系统的数据。这些数据不是用来响应查询的,而是用来作为数据的备份和在进程之间传输数据。这些可以是任何东西,从本地文件系统到分布式对象存储,如S3和HDFS。

The metadata storage is used to hold shared system metadata like segment usage information and task information. However, it’s never used to store the actual data. It’s a relational database like Apache Derby, PostgreSQL, or MySQL.

元数据存储用于保存共享系统元数据,如段使用信息和任务信息。然而,它从不用于存储实际数据。它是一个关系型数据库,如Apache Derby、PostgreSQL或MySQL。

Druid usage Apache Zookeeper for management of the current cluster state. It facilitates a number of operations in a Druid cluster like coordinator/overlord leader election, segment publishing protocol, and segment load/drop protocol.

Druid使用Apache Zookeeper来管理当前集群状态。它为Druid集群中的一些操作提供便利,如协调者/领头人选举、网段发布协议和网段加载/删除协议。

4. Druid Setup

4.德鲁伊的设置

Druid is designed to be deployed as a scalable, fault-tolerant cluster. However, setting up a production-grade Druid cluster is not trivial. As we have seen earlier, there are many processes and external dependencies to set up and configure. As it’s possible to create a cluster in a flexible manner, we must pay attention to our requirements to set up individual processes appropriately.

Druid被设计成一个可扩展的、容错的集群来部署。然而,建立一个生产级的Druid集群并非易事。正如我们前面所看到的,有许多进程和外部依赖性需要设置和配置。由于有可能以灵活的方式创建一个集群,我们必须注意我们的要求,适当地设置各个进程。

Also, Druid is only supported in Unix-like environments and not on Windows. Moreover, Java 8 or later is required to run Druid processes. There are several single-server configurations available for setting up Druid on a single machine for running tutorials and examples. However, for running a production workload, it’s recommended to set up a full-fledged Druid cluster with multiple machines.

另外,Druid只在类似Unix的环境中支持,在Windows上不支持。此外,运行Druid进程需要Java 8或更高版本。有几种单服务器配置可用于在单台机器上设置Druid,以运行教程和示例。但是,如果要运行生产工作负载,建议用多台机器建立一个成熟的Druid集群。

For the purpose of this tutorial, we’ll set up Druid on a single machine with the help of the official Docker image published on the Docker Hub. This enables us to run Druid on Windows as well, which, as we have discussed earlier, is not otherwise supported. There is a Docker compose file available, which creates a container for each Druid process and its external dependencies.

在本教程中,我们将在Docker Hub上发布的官方Docker镜像的帮助下,在一台机器上设置Druid。这使我们能够在Windows上运行Druid,正如我们之前所讨论的那样,Druid并不被支持。有一个Docker compose文件可用,它为每个Druid进程及其外部依赖关系创建一个容器。

We have to provide configuration values to Druid as environment variables. The easiest way to achieve this is to provide a file called “environment” in the same directory as the Docker compose file.

我们必须将配置值作为环境变量提供给Druid。实现这一目标的最简单方法是提供一个名为 “environment”的文件,该文件与Docker compose文件位于同一目录。

Once we have the Docker compose and the environment file in place, starting up Druid is as simple as running a command in the same directory:

一旦我们有了Docker compose和环境文件,启动Druid就像在同一目录下运行一个命令一样简单。

docker-compose up

This will bring up all the containers required for a single-machine Druid setup. We have to be careful to provide enough memory to the Docker machine, as Druid consumes a significant amount of resources.

这将调出单机Druid设置所需的所有容器。我们必须注意为Docker机器提供足够的内存,因为Druid会消耗大量的资源。

5. Ingesting Data

5.摄取数据

The first step towards building a data pipeline using Druid is to load data into Druid. This process is referred to as data ingestion or indexing in Druid architecture. We have to find a suitable dataset to proceed with this tutorial.

使用Druid构建数据管道的第一步是将数据加载到Druid。这个过程在Druid架构中被称为数据摄入或索引。我们必须找到一个合适的数据集来进行本教程的学习。

Now, as we have gathered so far, we have to pick up data that are events and have some temporal nature, to make the most out of the Druid infrastructure.

现在,正如我们迄今为止所收集到的,我们必须撷取那些事件和具有一定时间性的数据,以充分利用德鲁伊的基础设施。

The official guide for Druid uses simple and elegant data containing Wikipedia page edits for a specific date. We’ll continue to use that for our tutorial here.

德鲁伊的官方指南使用简单而优雅的数据,其中包含特定日期的维基百科页面编辑。我们在这里的教程将继续使用这个方法。

5.1. Data Model

5.1.数据模型

Let’s begin by examining the structure of the data we have with us. Most of the data pipeline we create is quite sensitive to data anomalies, and hence, it’s necessary to clean-up the data as much as possible.

让我们首先检查一下我们手上的数据结构。我们创建的大多数数据管道对数据异常相当敏感,因此,有必要尽可能地清理数据。

Although there are sophisticated ways and tools to perform data analysis, we’ll begin by visual inspection. A quick analysis reveals that the input data has events captured in JSON format, with a single event containing typical attributes:

虽然有复杂的方法和工具来进行数据分析,但我们将从视觉检查开始。快速分析发现,输入数据有以JSON格式捕获的事件,其中一个事件包含典型的属性

{
  "time": "2015-09-12T02:10:26.679Z",
  "channel": "#pt.wikipedia",
  "cityName": null,
  "comment": "Houveram problemas na última edição e tive de refazê-las, junto com as atualizações da página.",
  "countryIsoCode": "BR",
  "countryName": "Brazil",
  "isAnonymous": true,
  "isMinor": false,
  "isNew": false,
  "isRobot": false,
  "isUnpatrolled": true,
  "metroCode": null,
  "namespace": "Main",
  "page": "Catarina Muniz",
  "regionIsoCode": null,
  "regionName": null,
  "user": "181.213.37.148",
  "delta": 197,
  "added": 197,
  "deleted": 0
}

While there are quite a number of attributes defining this event, there are a few that are of special interest to us when working with Druid:

虽然定义这一事件的属性相当多,但在与德鲁伊合作时,有几个属性是我们特别感兴趣的。

  • Timestamp
  • Dimensions
  • Metrics

Druid requires a particular attribute to identify as a timestamp column. In most situations, Druid’s data parser is able to automatically detect the best candidate. But we always have a choice to select from, especially if we do not have a fitting attribute in our data.

Druid需要一个特定的属性来识别为时间戳列。在大多数情况下,Druid的数据解析器能够自动检测出最佳的候选人。但我们总是有选择的余地,特别是当我们的数据中没有一个合适的属性时。

Dimensions are the attributes that Druid stores as-is. We can use them for any purpose like grouping, filtering, or applying aggregators. We have a choice to select dimensions in the ingestion specification, which we’ll discuss further in the tutorial.

Dimensions是Druid按原样存储的属性。我们可以将它们用于任何目的,如分组、过滤或应用聚合器。我们可以选择在摄取规范中选择维度,我们将在教程中进一步讨论这个问题。

Metrics are the attributes that, unlike dimensions, are stored in aggregated form by default. We can choose an aggregation function for Druid to apply to these attributes during ingestion. Together with roll-up enabled, these can lead to compact data representations.

度量是一种属性,与维度不同,它默认以聚合形式存储。我们可以为Druid选择一个聚合函数,以便在摄取期间应用于这些属性。与启用的滚动功能一起,这些可以导致紧凑的数据表示。

5.2. Ingestion Methods

5.2.摄取方法

Now, we’ll discuss various ways we can perform the data ingestion in Druid. Typically, event-driven data are streaming in nature, which means they keep generating at various pace over time, like Wikipedia edits.

现在,我们将讨论在Druid中执行数据摄取的各种方法。通常情况下,事件驱动的数据是流式的,这意味着它们会随着时间的推移以不同的速度不断产生,就像维基百科的编辑一样。

However, we may have data batched for a period of time to go over, where data is more static in nature, like all Wikipedia edits that happened last year.

然而,我们可能会有一段时间的数据批处理,其中数据的性质比较固定,比如去年发生的所有维基百科的编辑。

We may also have diverse data use cases to solve, and Druid has fantastic support for most of them. Let’s go over two of the most common ways to use Druid in a data pipeline:

我们也可能有不同的数据用例需要解决,而Druid对大多数用例都有很好的支持。让我们来看看在数据管道中使用Druid的两种最常见的方式。

  • Streaming Ingestion
  • Batched Ingestion

The most common way to ingest data in Druid is through the Apache Streaming service, where Druid can read data directly from Kafka. Druid supports other platforms like Kinesis as well. We have to start supervisors on the Overload process, which creates and manages Kafka indexing tasks. We can start the supervisor by submitted a supervisor spec as a JSON file over the HTTP POST command of the Overload process.

在Druid中摄取数据的最常见的方式是通过Apache Streaming服务,Druid可以直接从Kafka读取数据。Druid也支持Kinesis等其他平台。我们必须在Overload进程上启动监督器,该进程创建并管理Kafka索引任务。我们可以通过Overload进程的HTTP POST命令提交一个监督者规格的JSON文件来启动监督者。

Alternatively, we can ingest data in batch — for example, from a local or remote file. It offers a choice for Hadoop-based batch ingestion for ingesting data from the Hadoop filesystem in the Hadoop file format. More commonly, we can choose the native batch ingestion either sequentially or in parallel. It’s a more convenient and simpler approach as it does not have any external dependencies.

另外,我们还可以批量摄取数据 – 例如,从本地或远程文件。它提供了基于Hadoop的批处理选择,用于从Hadoop文件系统中以Hadoop文件格式摄入数据。更常见的是,我们可以选择原生的批处理摄取,可以是顺序的,也可以是平行的。这是一个更方便、更简单的方法,因为它没有任何外部依赖性。

5.3. Defining the Task Specification

5.3.定义任务规范

For this tutorial, we’ll set up a native batch ingestion task for the input data we have. We have the option of configuring the task from the Druid console, which gives us an intuitive graphical interface. Alternately, we can define the task spec as a JSON file and submit it to the overlord process using a script or the command line.

在本教程中,我们将为我们的输入数据设置一个本地批处理任务。我们可以选择在Druid控制台中配置任务,它为我们提供了一个直观的图形界面。另外,我们可以将任务规格定义为一个JSON文件,并使用脚本或命令行将其提交给overlord进程

Let’s first define a simple task spec for ingesting our data in a file called wikipedia-index.json:

让我们首先定义一个简单的任务规范,在一个名为wikipedia-index.json的文件中摄入我们的数据。

{
  "type" : "index_parallel",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "dimensionsSpec" : {
        "dimensions" : [
          "channel",
          "cityName",
          "comment",
          "countryIsoCode",
          "countryName",
          "isAnonymous",
          "isMinor",
          "isNew",
          "isRobot",
          "isUnpatrolled",
          "metroCode",
          "namespace",
          "page",
          "regionIsoCode",
          "regionName",
          "user",
          { "name": "added", "type": "long" },
          { "name": "deleted", "type": "long" },
          { "name": "delta", "type": "long" }
        ]
      },
      "timestampSpec": {
        "column": "time",
        "format": "iso"
      },
      "metricsSpec" : [],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2015-09-12/2015-09-13"],
        "rollup" : false
      }
    },
    "ioConfig" : {
      "type" : "index_parallel",
      "inputSource" : {
        "type" : "local",
        "baseDir" : "quickstart/tutorial/",
        "filter" : "wikiticker-2015-09-12-sampled.json.gz"
      },
      "inputFormat" : {
        "type": "json"
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index_parallel",
      "maxRowsPerSegment" : 5000000,
      "maxRowsInMemory" : 25000
    }
  }
}

Let’s understand this task spec with respect to the basics we’ve gone through in previous sub-sections:

让我们结合我们在前面几个小节中所经历的基础知识来理解这个任务规范。

  • We have chosen the index_parallel task, which provides us native batch ingestion in parallel
  • The datasource we’ll be using in this task has the name “wikipedia”
  • The timestamp for our data is coming from the attribute “time”
  • There are a number of data attributes we are adding as dimensions
  • We’re not using any metrics for our data in the current task
  • Roll-up, which is enabled by default, should be disabled for this task
  • The input source for the task is a local file named wikiticker-2015-09-12-sampled.json.gz
  • We’re not using any secondary partition, which we can define in the tuningConfig

This task spec assumes that we’ve downloaded the data file wikiticker-2015-09-12-sampled.json.gz and kept it on the local machine where Druid is running. This may be trickier when we’re running Druid as a Docker container. Fortunately, Druid comes with this sample data present by default at the location quickstart/tutorial.

这个任务规范假设我们已经下载了数据文件 wikiticker-2015-09-12-sampled.json.gz并将其保存在Druid运行的本地机器上。当我们将Druid作为Docker容器运行时,这可能会比较棘手。幸运的是,Druid 默认提供了这个样本数据,位于quickstart/tutorial位置。

5.4. Submitting the Task Specification

5.4.提交任务规范

Finally, we can submit this task spec to the overlord process through the command line using a tool like curl:

最后,我们可以通过命令行使用curl等工具将这个任务规范提交给霸王进程。

curl -X 'POST' -H 'Content-Type:application/json' -d @wikipedia-index.json http://localhost:8081/druid/indexer/v1/task

Normally, the above command returns the ID of the task if the submission is successful. We can verify the state of our ingestion task through the Druid console or by performing queries, which we’ll go through in the next section.

通常情况下,如果提交成功,上述命令会返回任务的ID。我们可以通过Druid控制台或执行查询来验证我们的摄取任务的状态,我们将在下一节中进行介绍。

5.5. Advanced Ingestion Concepts

5.5.高级摄取概念

Druid is best suited for when we have a massive scale of data to deal with — certainly not the kind of data we’ve seen in this tutorial! Now, to enable features at scale, Druid architecture must provide suitable tools and tricks.

Druid最适合在我们有大规模的数据需要处理的时候使用–当然不是我们在本教程中看到的那种数据!它是在我们有大规模数据的时候使用。现在,为了实现规模化的功能,Druid架构必须提供合适的工具和技巧。

While we’ll not use them in this tutorial, let’s quickly discuss roll-up and partitioning.

虽然我们在本教程中不会使用它们,但让我们快速讨论一下卷积和分区。

Event data can soon grow in size to massive volumes, which can affect the query performance we can achieve. In many scenarios, it may be possible for us to summarise data over time. This is what we know as roll-up in Druid. When roll-up is enabled, Druid makes an effort to roll-up rows with identical dimensions and timestamps during ingestion. While it can save space, roll-up does lead to a loss in query precision, hence, we must use it rationally.

事件数据的规模很快就会增长到海量,这可能会影响我们能够实现的查询性能。在许多情况下,我们有可能对数据进行长期的总结。这就是我们所知道的Druid中的卷积功能。当启用滚动功能时,Druid会努力在摄取过程中将具有相同尺寸和时间戳的行滚动起来。虽然它可以节省空间,但卷积会导致查询精度的损失,因此,我们必须合理地使用它。

Another potential way to achieve better performance at the face of rising data volume is distributing the data and, hence, the workload. By default, Druid partitions the data based on timestamps into time chunks containing one or more segments. Further, we can decide to do secondary partitioning using natural dimensions to improve data locality. Moreover, Druid sorts data within every segment by timestamp first and then by other dimensions that we configure.

面对不断增长的数据量,实现更好性能的另一个潜在方法是分配数据,从而分配工作负载。默认情况下,Druid 根据时间戳将数据分割成包含一个或多个片段的时间块。此外,我们可以决定使用自然维度进行二次分区,以提高数据的定位性。此外,Druid在每个片段中首先按时间戳对数据进行排序,然后再按我们配置的其他维度进行排序。

6. Querying Data

6.查询数据

Once we have successfully performed the data ingestion, it should be ready for us to query. There are multiple ways to query data in Druid. The simplest way to execute a query in Druid is through the Druid console. However, we can also execute queries by sending HTTP commands or using a command-line tool.

一旦我们成功地进行了数据摄取,它就应该为我们的查询做好准备。在Druid中,有多种方法可以查询数据。在Druid中执行查询的最简单的方法是通过Druid控制台。然而,我们也可以通过发送HTTP命令或使用命令行工具来执行查询。

The two prominent ways to construct queries in Druid are native queries and SQL-like queries. We’re going to construct some basic queries in both these ways and send them over HTTP using curl. Let’s find out how we can create some simple queries on the data we have ingested earlier in Druid.

在Druid中构建查询的两种主要方式是本地查询和类SQL查询。我们将以这两种方式构建一些基本的查询,并使用curl通过HTTP发送。让我们来看看我们如何在Druid中创建一些简单的数据查询。

6.1. Native Queries

6.1 本地查询

Native queries in Druid use JSON objects, which we can send to a broker or a router for processing. We can send the queries over the HTTP POST command, amongst other ways, to do the same.

Druid中的本地查询使用JSON对象,我们可以将其发送给代理或路由器进行处理。我们可以通过HTTP POST命令来发送查询,除此之外,还可以通过其他方式来完成。

Let’s create a JSON file by the name simple_query_native.json:

让我们创建一个JSON文件,名称为simple_query_native.json

{
  "queryType" : "topN",
  "dataSource" : "wikipedia",
  "intervals" : ["2015-09-12/2015-09-13"],
  "granularity" : "all",
  "dimension" : "page",
  "metric" : "count",
  "threshold" : 10,
  "aggregations" : [
    {
      "type" : "count",
      "name" : "count"
    }
  ]
}

This is a simple query that fetches the top ten pages that received the top number of page edits between the 12th and 13th of September, 2019.

这是一个简单的查询,获取2019年9月12日和13日之间收到页面编辑次数最多的前十个页面。

Let’s post this over HTTP using curl:

让我们使用curl通过HTTP发布。

curl -X 'POST' -H 'Content-Type:application/json' -d @simple_query_native.json http://localhost:8888/druid/v2?pretty

This response contains the details of the top ten pages in JSON format:

该响应包含JSON格式的前十名页面的详细信息。

[ {
  "timestamp" : "2015-09-12T00:46:58.771Z",
  "result" : [ {
    "count" : 33,
    "page" : "Wikipedia:Vandalismusmeldung"
  }, {
    "count" : 28,
    "page" : "User:Cyde/List of candidates for speedy deletion/Subpage"
  }, {
    "count" : 27,
    "page" : "Jeremy Corbyn"
  }, {
    "count" : 21,
    "page" : "Wikipedia:Administrators' noticeboard/Incidents"
  }, {
    "count" : 20,
    "page" : "Flavia Pennetta"
  }, {
    "count" : 18,
    "page" : "Total Drama Presents: The Ridonculous Race"
  }, {
    "count" : 18,
    "page" : "User talk:Dudeperson176123"
  }, {
    "count" : 18,
    "page" : "Wikipédia:Le Bistro/12 septembre 2015"
  }, {
    "count" : 17,
    "page" : "Wikipedia:In the news/Candidates"
  }, {
    "count" : 17,
    "page" : "Wikipedia:Requests for page protection"
  } ]
} ]

6.2. Druid SQL

6.2.Druid SQL

Druid has a built-in SQL layer, which offers us the liberty to construct queries in familiar SQL-like constructs. It leverages Apache Calcite to parse and plan the queries. However, Druid SQL converts the SQL queries to native queries on the query broker before sending them to data processes.

Druid有一个内置的SQL层,它为我们提供了以熟悉的类似SQL的结构构建查询的自由。它利用Apache Calcite来解析和规划查询。然而,Druid SQL在将SQL查询发送到数据进程之前,会将其转换为查询代理上的本地查询。

Let’s see how we can create the same query as before, but using Druid SQL. As before, we’ll create a JSON file by the name simple_query_sql.json:

让我们看看如何创建与之前相同的查询,但使用Druid SQL。像以前一样,我们将创建一个JSON文件,名称为simple_query_sql.json

{
  "query":"SELECT page, COUNT(*) AS counts /
    FROM wikipedia WHERE \"__time\" /
    BETWEEN TIMESTAMP '2015-09-12 00:00:00' AND TIMESTAMP '2015-09-13 00:00:00' /
    GROUP BY page ORDER BY Edits DESC LIMIT 10"
}

Please note that the query has been broken into multiple lines for readability, but it should appear on a single line. Again, as before, we’ll POST this query over HTTP, but to a different endpoint:

请注意,为了便于阅读,该查询被分成了多行,但它应该出现在一行上。和以前一样,我们将通过HTTP发送这个查询,但要发送到一个不同的端点。

curl -X 'POST' -H 'Content-Type:application/json' -d @simple_query_sql.json http://localhost:8888/druid/v2/sql

The output should be very similar to what we achieved earlier with the native query.

输出结果应该与我们之前用本地查询实现的结果非常相似。

6.3. Query Types

6.3.查询类型

We saw, in the earlier section, a type of query where we fetched the top ten results for the metric “count” based on an interval. This is just one type of query that Druid supports, and it’s known as the TopN query. Of course, we can make this simple TopN query much more interesting by using filters and aggregations. But that is not in the scope of this tutorial. However, there are several other queries in Druid that may interest us.

我们在前面的章节中看到,一种查询方式,我们根据一个区间来获取指标 “count “的前十个结果。这只是Druid支持的一种类型的查询,它被称为TopN查询。当然,我们可以通过使用过滤器和聚合使这个简单的TopN查询更加有趣。但这不在本教程的范围内。然而,Druid中还有一些其他的查询,我们可能会感兴趣。

Some of the popular ones include Timeseries and GroupBy.

一些常用的包括Timeseries和GroupBy.

Timeseries queries return an array of JSON objects, where each object represents a value as described in the time-series query — for instance, the daily average of a dimension for the last one month.

时间序列查询返回一个JSON对象数组,其中每个对象代表时间序列查询中描述的一个值–例如,过去一个月的某个维度的日平均数。

GroupBy queries return an array of JSON objects, where each object represents a grouping as described in the group-by query. For example, we can query for the daily average of a dimension for the past month grouped by another dimension.

GroupBy查询返回一个JSON对象数组,其中每个对象代表一个分组,正如分组查询所描述的那样。例如,我们可以查询一个维度过去一个月的日平均数,并按另一个维度分组。

There are several other query types, including Scan, Search, TimeBoundary, SegmentMetadata, and DatasourceMetadata.

还有其他几种查询类型,包括ScanSearchTimeBoundarySegmentMetadata以及DatasourceMetadata

6.4. Advanced Query Concepts

6.4.高级查询概念

Druid offers some complex methods to create sophisticated queries for creating interesting data applications. These include various ways to slice and dice the data while still being able to provide incredible query performance.

Druid提供了一些复杂的方法来创建复杂的查询,以创建有趣的数据应用。这些方法包括各种切分数据的方法,同时还能提供难以置信的查询性能。

While a detailed discussion of them is beyond the scope of this tutorial, let’s discuss some of the important ones like Joins and Lookups, Multitenancy, and Query Caching.

虽然对它们的详细讨论超出了本教程的范围,但让我们讨论一下一些重要的内容,如连接和查找、多租户和查询缓存

Druid supports two ways of joining the data. The first is the join operators, and the second is query-time lookups. However, for better query performance, it’s advisable to avoid query-time joins.

Druid支持两种连接数据的方式。第一种是连接运算符,第二种是查询时查找。然而,为了获得更好的查询性能,建议避免查询时连接。

Multitenancy refers to the feature of supporting multiple tenants on the same Druid infrastructure while still offering them logical isolation. It’s possible to achieve this in Druid through separate data sources per tenant or data partitioning by the tenant.

多租户指的是在同一个Druid基础设施上支持多个租户,同时仍为他们提供逻辑隔离的功能。在Druid中,可以通过每个租户的独立数据源或租户的数据分区来实现这一点。

And finally, query caching is the key to performance in data-intensive applications. Druid supports query result caching at the segment and the query result levels. Further, the cache data can reside in memory or in external persistent storage.

最后,查询缓存是数据密集型应用程序性能的关键。Druid支持段和查询结果层面的查询结果缓存。此外,缓存数据可以驻留在内存中或外部持久性存储中。

7. Language Bindings

7.语言捆绑

Although Druid has excellent support for creating ingestion specs and defining queries in JSON, it may be tedious sometimes to define these queries in JSON, especially when queries get complex. Unfortunately, Druid doesn’t offer a client library in any specific language to help us in this regard. But there are quite a few language bindings that have been developed by the community. One such client library is also available for Java.

尽管Druid对创建摄取规范和定义JSON查询有很好的支持,但有时用JSON定义这些查询可能会很繁琐,特别是当查询变得复杂时。不幸的是,Druid并没有提供任何特定语言的客户端库来帮助我们解决这个问题。但是有相当多的语言绑定已经被社区开发出来了。有一个这样的客户端库也可用于Java。

We’ll quickly see how we can build the TopN query we used earlier using this client library in Java.

我们很快就会看到我们如何使用这个Java中的客户端库来构建我们之前使用的TopN查询。

Let’s begin by defining the required dependency in Maven:

让我们首先定义Maven中的必要依赖

<dependency>
    <groupId>in.zapr.druid</groupId>
    <artifactId>druidry</artifactId>
    <version>2.14</version>
</dependency>

After this, we should be able to use the client library and create our TopN query:

在这之后,我们应该能够使用客户端库并创建我们的TopN查询。

DateTime startTime = new DateTime(2015, 9, 12, 0, 0, 0, DateTimeZone.UTC);
DateTime endTime = new DateTime(2015, 9, 13, 0, 0, 0, DateTimeZone.UTC);
Interval interval = new Interval(startTime, endTime);
Granularity granularity = new SimpleGranularity(PredefinedGranularity.ALL);
DruidDimension dimension = new SimpleDimension("page");
TopNMetric metric = new SimpleMetric("count");
DruidTopNQuery query = DruidTopNQuery.builder()
  .dataSource("wikipedia")
  .dimension(dimension)
  .threshold(10)
  .topNMetric(metric)
  .granularity(granularity)
  .filter(filter)
  .aggregators(Arrays.asList(new LongSumAggregator("count", "count")))
  .intervals(Collections.singletonList(interval)).build();

After this, we can simply generate the required JSON structure, which we can use in the HTTP POST call:

在这之后,我们可以简单地生成所需的JSON结构,我们可以在HTTP POST调用中使用。

ObjectMapper mapper = new ObjectMapper();
String requiredJson = mapper.writeValueAsString(query);

8. Conclusion

8.结语

In this tutorial, we went through the basics of event data and Apache Druid architecture.

在本教程中,我们了解了事件数据和Apache Druid架构的基础知识。

Further, we set up a primary Druid cluster using Docker containers on our local machine. Then, we also went through the process of ingesting a sample dataset in Druid using the native batch task. After this, we saw the different ways we have to query our data in Druid. Lastly, we went through a client library in Java to construct Druid queries.

此外,我们在本地机器上使用Docker容器建立了一个主Druid集群。然后,我们还经历了使用本地批处理任务在Druid中摄入样本数据集的过程。在这之后,我们看到了在Druid中查询数据的不同方式。最后,我们通过一个Java的客户端库来构建Druid查询。

We have just scratched the surface of features that Druid has to offer. There are several possibilities in which Druid can help us build our data pipeline and create data applications. The advanced ingestion and querying features are the obvious next steps to learn, for effectively leveraging the power of Druid.

我们只是触及了Druid所能提供的功能的表面。Druid可以帮助我们建立数据管道和创建数据应用,这有几种可能性。为了有效地利用Druid的力量,高级的摄取和查询功能显然是接下来要学习的。

Moreover, creating a suitable Druid cluster that scales the individual processes as per the need should be the target to maximize the benefits.

此外,创建一个合适的Druid集群,根据需要对各个进程进行扩展,应该是实现利益最大化的目标。