Efficient Word Frequency Calculator in Java – Java中的高效词频计算器

最后修改: 2017年 12月 19日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

In this tutorial, we’ll show various ways of implementing a word counter in Java.

在本教程中,我们将展示在Java中实现单词计数器的各种方法。

2. Counter Implementations

2.计数器的实施

Let’s start by simply calculating the word count of words in this array:

让我们先简单计算一下这个数组中的字数。

static String[] COUNTRY_NAMES 
  = { "China", "Australia", "India", "USA", "USSR", "UK", "China", 
  "France", "Poland", "Austria", "India", "USA", "Egypt", "China" };

If we want to process huge files, we need to go for other options described here.

如果我们想处理巨大的文件,我们需要去选择其他的选项,在这里描述。

2.1. Map With Integers

2.1.MapIntegers

One of the simplest solutions would be to create a Map, store words as keys and the number of occurrences as values:

最简单的解决方案之一是创建一个Map,将单词作为键存储,将出现的次数作为值存储。

Map<String, Integer> counterMap = new HashMap<>();

for (String country : COUNTRY_NAMES) { 
    counterMap.compute(country, (k, v) -> v == null ? 1 : v + 1); 
}

assertEquals(3, counterMap.get("China").intValue());
assertEquals(2, counterMap.get("India").intValue());

We simply used Map‘s handy compute method which increments the counter or initializes it with 1 if the key isn’t present.

我们简单地使用了Map的方便的compute方法,该方法增加了计数器,或者在键不存在的情况下将其初始化为1。

However, this method of creating counter isn’t efficient as Integer is immutable, so every time when we increment the counter, we create a new Integer object.

然而,这种创建计数器的方法并不高效,因为Integer是不可变的,所以每次当我们增加计数器时,我们都会创建一个新的Integer对象。

2.2. Stream API

2.2.流API

Now, let’s leverage Java 8 Stream API, parallel Streams, and the groupingBy() collector:

现在,让我们利用Java 8 Stream API、并行StreamsgroupingBy()采集器。

@Test
public void whenMapWithLambdaAndWrapperCounter_runsSuccessfully() {
    Map<String, Long> counterMap = new HashMap<>();
 
    Stream.of(COUNTRY_NAMES)
      .collect(Collectors.groupingBy(k -> k, ()-> counterMap,
	    Collectors.counting());

    assertEquals(3, counterMap.get("China").intValue());
    assertEquals(2, counterMap.get("India").intValue());
}

Similarly, we could use a parallelStream:

同样,我们可以使用一个parallelStream

@Test
public void whenMapWithLambdaAndWrapperCounter_runsSuccessfully() {
    Map<String, Long> counterMap = new HashMap<>();
 
    Stream.of(COUNTRY_NAMES).parallel()
      .collect(Collectors.groupingBy(k -> k, ()-> counterMap,
	    Collectors.counting());

    assertEquals(3, counterMap.get("China").intValue());
    assertEquals(2, counterMap.get("India").intValue());
}

2.3. Map With an Integer Array

2.3.MapInteger阵列

Next, let’s use a Map that wraps a counter within an Integer array used as a value:

接下来,让我们使用一个Map,将一个计数器包裹在一个用作值的Integer数组内。

@Test
public void whenMapWithPrimitiveArrayCounter_runsSuccessfully() {
    Map<String, int[]> counterMap = new HashMap<>();

    counterWithPrimitiveArray(counterMap);

    assertEquals(3, counterMap.get("China")[0]);
    assertEquals(2, counterMap.get("India")[0]);
}
 
private void counterWithPrimitiveArray(Map<String, int[]> counterMap) {
    for (String country : COUNTRY_NAMES) {
        counterMap.compute(country, (k, v) -> v == null ? 
          new int[] { 0 } : v)[0]++;
    }
}

Note how we created a simple HashMap with int arrays as values.

注意我们是如何创建一个简单的HashMap,以int数组作为值。

In the counterWithPrimitiveArray method, while iterating over each value of the array, we:

counterWithPrimitiveArray方法中,在迭代数组的每个值时,我们。

  • invoke a get on the counterMap by passing the country name as a key
  • check whether a key was already present or not. If the entry is already present, we create a new instance of primitive integer array with a single “1”. If the entry is absent, we increment the counter value present in the array

This method is better than the wrapper implementation – as it creates fewer objects.

这种方法比封装器的实现更好–因为它创建的对象更少。

2.4. Map With a MutableInteger

2.4.MapMutableInteger

Next, let’s create a wrapper object which embeds a primitive integer counter as below:

接下来,让我们创建一个封装对象,嵌入一个原始的整数计数器,如下所示。

private static class MutableInteger {
    int count = 1;
	
    public void increment() {
        this.count++;
    }
	
    // getter and setter
}

Let’s see how we can make use of above class as a counter:

让我们看看我们如何利用上述类作为计数器。

@Test
public void whenMapWithMutableIntegerCounter_runsSuccessfully() {
    Map<String, MutableInteger> counterMap = new HashMap<>();

    mapWithMutableInteger(counterMap);

    assertEquals(3, counterMap.get("China").getCount());
    assertEquals(2, counterMap.get("India").getCount());
}
private void counterWithMutableInteger(
  Map<String, MutableInteger> counterMap) {
    for (String country : COUNTRY_NAMES) {
        counterMap.compute(country, (k, v) -> v == null 
          ? new MutableInteger(0) : v).increment();
    }
}

In the mapWithMutableInteger method, while iterating over each country in the COUNTRY_NAMES array, we:

mapWithMutableInteger方法中,在迭代COUNTRY_NAMES数组中的每个国家时,我们。

  • invoke a get on the counterMap by passing the country name as a key
  • check whether the key is already present or not. If an entry is absent, we create an instance of MutableInteger which sets the counter value as 1. We increment the counter value present in the MutableInteger if the country is present in the map

This method of creating a counter is better than the previous one – as we’re reusing the same MutableInteger and thereby creating fewer objects.

这种创建计数器的方法比之前的方法更好–因为我们重复使用同一个MutableInteger,因此创建的对象更少。

This is how Apache Collections HashMultiSet works where it embeds a HashMap with value as MutableInteger internally.

这就是Apache Collections HashMultiSet的工作原理,它在内部嵌入一个HashMap,其值为MutableInteger

3. Performance Analysis

3.性能分析

Here’s the chart that compares the performance of each and every method listed above.
Word Frequency Counter

这是比较上述每一种方法性能的图表。
词频计数器

Above chart is created by using JMH and here’s the code that created the statistics above:

以上图表是通过使用JMH创建的,以下是创建上述统计数据的代码。

Map<String, Integer> counterMap = new HashMap<>();
Map<String, MutableInteger> counterMutableIntMap = new HashMap<>();
Map<String, int[]> counterWithIntArrayMap = new HashMap<>();
Map<String, Long> counterWithLongWrapperMap = new HashMap<>();
 
@Benchmark
public void wrapperAsCounter() {
    counterWithWrapperObject(counterMap);
}

@Benchmark
public void lambdaExpressionWithWrapper() {
    counterWithLambdaAndWrapper(counterWithLongWrapperMap );
}

@Benchmark
public void parallelStreamWithWrapper() {
    counterWithParallelStreamAndWrapper(counterWithLongWrapperStreamMap);
}
    
@Benchmark
public void mutableIntegerAsCounter() {
    counterWithMutableInteger(counterMutableIntMap);
}
    
@Benchmark
public void mapWithPrimitiveArray() {
   counterWithPrimitiveArray(counterWithIntArrayMap);
}

4. Conclusion

4.结论

In this quick article, we illustrated various ways of creating word counters using Java.

在这篇快速文章中,我们说明了使用Java创建单词计数器的各种方法。

The implementation of these examples can be found in the GitHub project – this is a Maven-based project, so it should be easy to import and run as is.

这些例子的实现可以在GitHub项目中找到–这是一个基于Maven的项目,所以应该很容易导入并按原样运行。