1. Overview
1.概述
Data serialization is a technique of converting data into binary or text format. There are multiple systems available for this purpose. Apache Avro is one of those data serialization systems.
数据序列化是一种将数据转换为二进制或文本格式的技术。有多种系统可用于这一目的。Apache Avro是这些数据序列化系统中的一个。
Avro is a language independent, schema-based data serialization library. It uses a schema to perform serialization and deserialization. Moreover, Avro uses a JSON format to specify the data structure which makes it more powerful.
Avro是一个独立于语言、基于模式的数据序列化库。它使用一个模式来执行序列化和反序列化。此外,Avro使用JSON格式来指定数据结构,这使得它更加强大。
In this tutorial, we’ll explore more about Avro setup, the Java API to perform serialization and a comparison of Avro with other data serialization systems.
在本教程中,我们将探讨更多关于Avro设置、执行序列化的Java API以及Avro与其他数据序列化系统的比较。
We’ll focus primarily on schema creation which is the base of the whole system.
我们将主要关注模式的创建,这是整个系统的基础。
2. Apache Avro
2.阿帕奇-阿夫罗
Avro is a language-independent serialization library. To do this Avro uses a schema which is one of the core components. It stores the schema in a file for further data processing.
Avro是一个独立于语言的序列化库。为了做到这一点,Avro使用了一个模式,这是核心组件之一。它将模式存储在一个文件中,以便进一步处理数据。
Avro is the best fit for Big Data processing. It’s quite popular in Hadoop and Kafka world for its faster processing.
Avro是最适合大数据处理的。它在Hadoop和Kafka世界中相当受欢迎,因为它的处理速度更快。
Avro creates a data file where it keeps data along with schema in its metadata section. Above all, it provides a rich data structure which makes it more popular than other similar solutions.
Avro创建了一个数据文件,它将数据与模式一起保存在其元数据部分。最重要的是,它提供了一个丰富的数据结构,这使得它比其他类似的解决方案更受欢迎。
To use Avro for serialization, we need to follow the steps mentioned below.
为了使用Avro进行序列化,我们需要遵循下面提到的步骤。
3. Problem Statement
3.问题陈述
Let’s start with defining a class called AvroHttRequest that we’ll use for our examples. The class contains primitive as well as complex type attributes:
让我们从定义一个叫做AvroHttRequest的类开始,我们将在例子中使用这个类。该类包含原始的和复杂的类型属性。
class AvroHttpRequest {
private long requestTime;
private ClientIdentifier clientIdentifier;
private List<String> employeeNames;
private Active active;
}
Here, requestTime is a primitive value. ClientIdentifier is another class which represents a complex type. We also have employeeName which is again a complex type. Active is an enum to describe whether the given list of employees is active or not.
这里,requestTime是一个原始值。ClientIdentifier是另一个代表复杂类型的类。我们也有employeeName,它也是一个复杂的类型。Active是一个枚举,用来描述给定的雇员列表是否是活动的。
Our objective is to serialize and de-serialize the AvroHttRequest class using Apache Avro.
我们的目标是使用Apache Avro对AvroHttRequest类进行序列化和去序列化。
4. Avro Data Types
4.Avro数据类型
Before proceeding further, let’s discuss the data types supported by Avro.
在进一步进行之前,让我们讨论一下Avro所支持的数据类型。
Avro supports two types of data:
Avro支持两种类型的数据。
- Primitive type: Avro supports all the primitive types. We use primitive type name to define a type of a given field. For example, a value which holds a String should be declared as {“type”: “string”} in Schema
- Complex type: Avro supports six kinds of complex types: records, enums, arrays, maps, unions and fixed
For example, in our problem statement, ClientIdentifier is a record.
例如,在我们的问题陈述中,ClientIdentifier是一个记录。
In that case schema for ClientIdentifier should look like:
在这种情况下,ClientIdentifier的模式应该看起来像。
{
"type":"record",
"name":"ClientIdentifier",
"namespace":"com.baeldung.avro",
"fields":[
{
"name":"hostName",
"type":"string"
},
{
"name":"ipAddress",
"type":"string"
}
]
}
5. Using Avro
5.使用Avro
To start with, let’s add the Maven dependencies we’ll need to our pom.xml file.
首先,让我们在pom.xml文件中添加我们需要的Maven依赖项。
We should include the following dependencies:
我们应该包括以下依赖性。
- Apache Avro – core components
- Compiler – Apache Avro Compilers for Avro IDL and Avro Specific Java APIT
- Tools – which includes Apache Avro command line tools and utilities
- Apache Avro Maven Plugin for Maven projects
We’re using version 1.8.2 for this tutorial.
我们在本教程中使用1.8.2版本。
However, it’s always advised to find the latest version on Maven Central:
不过,我们始终建议在Maven Central上找到最新版本。
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-compiler</artifactId>
<version>1.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.8.2</version>
</dependency>
After adding maven dependencies, the next steps will be:
添加完maven依赖项后,接下来的步骤是。
- Schema creation
- Reading the schema in our program
- Serializing our data using Avro
- Finally, de-serialize the data
6. Schema Creation
6.模式创建
Avro describes its Schema using a JSON format. There are mainly four attributes for a given Avro Schema:
Avro使用JSON格式描述其模式。对于一个给定的Avro模式,主要有四个属性。
- Type- which describes the type of Schema whether its complex type or primitive value
- Namespace- which describes the namespace where the given Schema belongs to
- Name – the name of the Schema
- Fields- which tells about the fields associated with a given schema. Fields can be of primitive as well as complex type.
One way of creating the schema is to write the JSON representation, as we saw in the previous sections.
创建模式的一种方法是编写JSON表示法,正如我们在前几节看到的那样。
We can also create a schema using SchemaBuilder which is undeniably a better and efficient way to create it.
我们也可以使用SchemaBuilder来创建模式,不可否认,这是一种更好、更高效的创建方式。
6.1. SchemaBuilder Utility
6.1.SchemaBuilder实用程序
The class org.apache.avro.SchemaBuilder is useful for creating the Schema.
类org.apache.avro.SchemaBuilder对于创建Schema非常有用。
First of all, let’s create the schema for ClientIdentifier:
首先,让我们为ClientIdentifier创建模式:。
Schema clientIdentifier = SchemaBuilder.record("ClientIdentifier")
.namespace("com.baeldung.avro")
.fields().requiredString("hostName").requiredString("ipAddress")
.endRecord();
Now, let’s use this for creating an avroHttpRequest schema:
现在,让我们用它来创建一个avroHttpRequest模式。
Schema avroHttpRequest = SchemaBuilder.record("AvroHttpRequest")
.namespace("com.baeldung.avro")
.fields().requiredLong("requestTime")
.name("clientIdentifier")
.type(clientIdentifier)
.noDefault()
.name("employeeNames")
.type()
.array()
.items()
.stringType()
.arrayDefault(null)
.name("active")
.type()
.enumeration("Active")
.symbols("YES","NO")
.noDefault()
.endRecord();
It’s important to note here that we’ve assigned clientIdentifier as the type for the clientIdentifier field. In this case, clientIdentifier used to define type is the same schema we created before.
这里需要注意的是,我们已经指定clientIdentifier作为clientIdentifier域的类型。在这种情况下,用于定义类型的clientIdentifier是我们之前创建的同一模式。
Later we can apply the toString method to get the JSON structure of Schema.
之后我们可以应用toString方法来获得JSON结构的Schema。
Schema files are saved using the .avsc extension. Let’s save our generated schema to the “src/main/resources/avroHttpRequest-schema.avsc” file.
模式文件是用.avsc扩展名保存的。让我们把我们生成的模式保存到“src/main/resources/avroHttpRequest-schema.avsc”文件中。
7. Reading the Schema
7.读取模式
Reading a schema is more or less about creating Avro classes for the given schema. Once Avro classes are created we can use them to serialize and deserialize objects.
读取模式或多或少是为了为给定模式创建Avro类。一旦Avro类被创建,我们就可以用它们来序列化和反序列化对象。
There are two ways to create Avro classes:
有两种方法来创建Avro类。
- Programmatically generating Avro classes: Classes can be generated using SchemaCompiler. There are a couple of APIs which we can use for generating Java classes. We can find the code for generation classes on GitHub.
- Using Maven to generate classes
We do have one maven plugin which does the job well. We need to include the plugin and run mvn clean install.
我们确实有一个maven插件可以很好地完成这项工作。我们需要包括该插件并运行mvn clean install。
Let’s add the plugin to our pom.xml file:
让我们把这个插件添加到我们的pom.xml文件。
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>${avro.version}</version>
<executions>
<execution>
<id>schemas</id>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
<goal>protocol</goal>
<goal>idl-protocol</goal>
</goals>
<configuration>
<sourceDirectory>${project.basedir}/src/main/resources/</sourceDirectory>
<outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
8. Serialization and Deserialization with Avro
8.用Avro进行序列化和反序列化
As we’re done with generating the schema let’s continue exploring the serialization part.
由于我们已经完成了模式的生成,让我们继续探索序列化的部分。
There are two data serialization formats which Avro supports: JSON format and Binary format.
有两种数据序列化格式,Avro支持。JSON格式和二进制格式。
First, we’ll focus on the JSON format and then we’ll discuss the Binary format.
首先,我们将专注于JSON格式,然后我们将讨论二进制格式。
Before proceeding further, we should go through a few key interfaces. We can use the interfaces and classes below for serialization:
在进一步开展工作之前,我们应该先了解一下几个关键的接口。我们可以使用下面的接口和类来进行序列化。
DatumWriter<T>: We should use this to write data on a given Schema. We’ll be using the SpecificDatumWriter implementation in our example, however, DatumWriter has other implementations as well. Other implementations are GenericDatumWriter, Json.Writer, ProtobufDatumWriter, ReflectDatumWriter, ThriftDatumWriter.
DatumWriter<T>:我们应该用它来在一个给定的模式上写数据。我们将在我们的例子中使用SpecificDatumWriter实现,然而,DatumWriter也有其他实现。其他实现包括GenericDatumWriter、Json.Writer、ProtobufDatumWriter、ReflectDatumWriter、ThriftDatumWriter.。
Encoder: Encoder is used or defining the format as previously mentioned. EncoderFactory provides two types of encoders, binary encoder, and JSON encoder.
Encoder: 编码器是用来定义前面提到的格式的。EncoderFactory提供两种类型的编码器,二进制编码器和JSON编码器。
DatumReader<D>: Single interface for de-serialization. Again, it got multiple implementations, but we’ll be using SpecificDatumReader in our example. Other implementations are- GenericDatumReader, Json.ObjectReader, Json.Reader, ProtobufDatumReader, ReflectDatumReader, ThriftDatumReader.
DatumReader<D>:用于去序列化的单一接口。同样,它有多种实现,但我们将在我们的例子中使用SpecificDatumReader。其他的实现有:GenericDatumReader, Json.ObjectReader, Json.Reader, ProtobufDatumReader, ReflectDatumReader, ThriftDatumReader。
Decoder: Decoder is used while de-serializing the data. Decoderfactory provides two types of decoders: binary decoder and JSON decoder.
解码器:解码器是在对数据进行反序列化时使用。Decoderfactory提供两种类型的解码器:二进制解码器和JSON解码器。
Next, let’s see how serialization and de-serialization happen in Avro.
接下来,让我们看看在Avro中如何进行序列化和去序列化。
8.1. Serialization
8.1.序列化
We’ll take the example of AvroHttpRequest class and try to serialize it using Avro.
我们将以AvroHttpRequest类为例,尝试用Avro对其进行序列化。
First of all, let’s serialize it in JSON format:
首先,让我们把它序列化为JSON格式。
public byte[] serealizeAvroHttpRequestJSON(
AvroHttpRequest request) {
DatumWriter<AvroHttpRequest> writer = new SpecificDatumWriter<>(
AvroHttpRequest.class);
byte[] data = new byte[0];
ByteArrayOutputStream stream = new ByteArrayOutputStream();
Encoder jsonEncoder = null;
try {
jsonEncoder = EncoderFactory.get().jsonEncoder(
AvroHttpRequest.getClassSchema(), stream);
writer.write(request, jsonEncoder);
jsonEncoder.flush();
data = stream.toByteArray();
} catch (IOException e) {
logger.error("Serialization error:" + e.getMessage());
}
return data;
}
Let’s have a look at a test case for this method:
让我们来看看这个方法的测试案例。
@Test
public void whenSerialized_UsingJSONEncoder_ObjectGetsSerialized(){
byte[] data = serealizer.serealizeAvroHttpRequestJSON(request);
assertTrue(Objects.nonNull(data));
assertTrue(data.length > 0);
}
Here we’ve used the jsonEncoder method and passing the schema to it.
这里我们使用了jsonEncoder方法,并将模式传递给它。
If we wanted to use a binary encoder, we need to replace the jsonEncoder() method with binaryEncoder():
如果我们想使用二进制编码器,我们需要将jsonEncoder()方法替换为binaryEncoder():。
Encoder jsonEncoder = EncoderFactory.get().binaryEncoder(stream,null);
8.2. Deserialization
8.2.反序列化
To do this, we’ll be using the above-mentioned DatumReader and Decoder interfaces.
为了做到这一点,我们将使用上述的DatumReader和Decoder接口。
As we used EncoderFactory to get an Encoder, similarly we’ll use DecoderFactory to get a Decoder object.
正如我们使用EncoderFactory来获得一个Encoder,同样地,我们将使用DecoderFactory来获得一个Decoder对象。
Let’s de-serialize the data using JSON format:
让我们使用JSON格式对数据进行反序列化。
public AvroHttpRequest deSerealizeAvroHttpRequestJSON(byte[] data) {
DatumReader<AvroHttpRequest> reader
= new SpecificDatumReader<>(AvroHttpRequest.class);
Decoder decoder = null;
try {
decoder = DecoderFactory.get().jsonDecoder(
AvroHttpRequest.getClassSchema(), new String(data));
return reader.read(null, decoder);
} catch (IOException e) {
logger.error("Deserialization error:" + e.getMessage());
}
}
And let’s see the test case:
让我们看看这个测试案例。
@Test
public void whenDeserializeUsingJSONDecoder_thenActualAndExpectedObjectsAreEqual(){
byte[] data = serealizer.serealizeAvroHttpRequestJSON(request);
AvroHttpRequest actualRequest = deSerealizer
.deSerealizeAvroHttpRequestJSON(data);
assertEquals(actualRequest,request);
assertTrue(actualRequest.getRequestTime()
.equals(request.getRequestTime()));
}
Similarly, we can use a binary decoder:
同样地,我们可以使用二进制解码器。
Decoder decoder = DecoderFactory.get().binaryDecoder(data, null);
9. Conclusion
9.结语
Apache Avro is especially useful while dealing with big data. It offers data serialization in binary as well as JSON format which can be used as per the use case.
Apache Avro在处理大数据时特别有用。它提供二进制和JSON格式的数据序列化,可以根据使用情况使用。
The Avro serialization process is faster, and it’s space efficient as well. Avro does not keep the field type information with each field; instead, it creates metadata in a schema.
Avro的序列化过程更快,而且也很节省空间。Avro不保留每个字段的类型信息;相反,它在一个模式中创建元数据。
Last but not least Avro has a great binding with a wide range of programming languages, which gives it an edge.
最后但并非最不重要的是,Avro与各种编程语言有很好的结合,这使它具有优势。
As always, the code can be found over on GitHub.
一如既往,代码可以在GitHub上找到over。