Histograms with Apache Commons Frequency – 带有Apache Commons频率的直方图

最后修改: 2018年 6月 12日

中文/混合/英文(键盘快捷键:t)

 

1. Overview

1.概述

In this tutorial, we’re going to look at how we can present data on a histogram with the help of Apache Commons Frequency class.

在本教程中,我们将看看如何在Apache Commons 频率类的帮助下,在直方图上呈现数据。

The Frequency class is part of part of the Apache Commons Math library explored in this article.

频率类是本文章中探讨的Apache Commons数学库的一部分。

A histogram is a diagram of connected bars that shows the occurrence of a range of data in a dataset. It differs from a bar chart in that it’s used to display the distribution of continuous, quantitative variables while a bar chart is used to display categorical data.

直方图是一个由相连的条形图组成的图表,显示数据集中某一范围的数据的出现情况。它与柱状图的不同之处在于,柱状图用于显示连续、定量变量的分布,而柱状图则用于显示分类数据。

2. Project Dependencies

2.项目依赖性

In this article, we’ll be using a Maven project with the following dependencies:

在本文中,我们将使用一个具有以下依赖关系的Maven项目。

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-math3</artifactId>
    <version>3.6.1</version>
</dependency>
<dependency>
    <groupId>org.knowm.xchart</groupId>
    <artifactId>xchart</artifactId>
    <version>3.5.2</version>
</dependency>

The commons-math3 library contains the Frequency class that we’ll be using to determine the occurrence of variables in our dataset. The xchart library is what we’ll use to display the histogram in a GUI.

commons-math3库包含Frequency类,我们将用它来确定数据集中变量的出现。xchart库是我们用来在GUI中显示柱状图的。

The latest version of commons-math3 and xchart can be found on Maven Central.

commons-math3xchart的最新版本可以在Maven中心找到。

3. Calculating the Frequency of Variables

3.计算变量的频率

For this tutorial, we’ll be using a dataset comprising of the students’ age in a particular school. We’ll like to see the frequency of different age groups and observe their distribution on a histogram chart.

在本教程中,我们将使用一个由某所学校的学生年龄组成的数据集。我们想看看不同年龄组的频率,并观察他们在直方图上的分布。

Let’s represent the dataset with a List collection and use it to populate an instance of the Frequency class:

让我们用一个List集合来表示数据集,并使用它来填充Frequency类的一个实例。

List<Integer> datasetList = Arrays.asList(
  36, 25, 38, 46, 55, 68, 
  72, 55, 36, 38, 67, 45, 22, 
  48, 91, 46, 52, 61, 58, 55);
Frequency frequency = new Frequency();
datasetList.forEach(d -> frequency.addValue(Double.parseDouble(d.toString())));

Now that we’ve populated our instance of the Frequency class, we’re going to get the count of each age in a bin and sum it up so we can get the total frequency of ages in a particular age group:

现在我们已经填充了我们的频率类的实例,我们将得到一个bin中每个年龄段的计数,并将其相加,这样我们就可以得到一个特定年龄组的总频率

datasetList.stream()
  .map(d -> Double.parseDouble(d.toString()))
  .distinct()
  .forEach(observation -> {
      long observationFrequency = frequency.getCount(observation);
      int upperBoundary = (observation > classWidth)
        ? Math.multiplyExact( (int) Math.ceil(observation / classWidth), classWidth)
        : classWidth;
      int lowerBoundary = (upperBoundary > classWidth)
        ? Math.subtractExact(upperBoundary, classWidth)
        : 0;
      String bin = lowerBoundary + "-" + upperBoundary;

      updateDistributionMap(lowerBoundary, bin, observationFrequency);
  });

From the snippet above, we first determine the frequency of the observation using the getCount() of the Frequency class. The method returns the total number of occurrence of the observation

从上面的片段中,我们首先使用Frequency类的getCount()来确定观测的频率。该方法返回观察的总出现次数。

Using the current observation, we dynamically determine the group it belongs to by figuring out its upper and lower boundaries relative to the class width – which is 10.

使用当前的观察,我们通过计算其相对于类宽度的上下限,动态地确定它所属的组–该类宽度为10.

The upper and lower boundaries are concatenated to form a bin, which is stored alongside the observationFrequency in a distributionMap using the updateDistributionMap().

上下边界被连接起来形成一个bin,与观测频率一起使用updateDistributionMap()存储在distributionMap中。

If the bin exists already we update the frequency, else we add it as key and set the frequency of the current observation as its value. Note that we kept track of the processed observations to avoid duplicates.

如果bin已经存在,我们就更新频率,否则就把它作为键加入,并把当前observation的频率作为其值。请注意,我们对已处理的观察结果进行了跟踪,以避免重复。

The Frequency class also have methods for determining the percentage and cumulative percentage of a variable in a dataset.

Frequency类也有方法来确定一个变量在数据集中的百分比和累积百分比。

4. Plotting the Histogram Chart

4.绘制直方图图表

Now that we’ve processed our raw dataset into a map of age groups and their respective frequencies we can use the xchart library to display the data in a histogram chart:

现在,我们已经将我们的原始数据集处理成一张年龄组及其各自频率的地图,我们可以使用xchart库在直方图中显示数据。

CategoryChart chart = new CategoryChartBuilder().width(800).height(600)
  .title("Age Distribution")
  .xAxisTitle("Age Group")
  .yAxisTitle("Frequency")
  .build();

chart.getStyler().setLegendPosition(Styler.LegendPosition.InsideNW);
chart.getStyler().setAvailableSpaceFill(0.99);
chart.getStyler().setOverlapped(true);

List yData = new ArrayList();
yData.addAll(distributionMap.values());
List xData = Arrays.asList(distributionMap.keySet().toArray());
chart.addSeries("age group", xData, yData);

new SwingWrapper<>(chart).displayChart();

We created an instance of a CategoryChart using the chart builder, then we configured it and populate it with the data for the x and y-axis.

我们使用图表生成器创建了一个CategoryChart的实例,然后我们配置了它,并为它填充了x和y轴的数据。

We finally display the chart in a GUI using the SwingWrapper:

我们最后使用SwingWrapper在GUI中显示该图表:

xchart histogram

From the histogram above, we can see that there are no students with the age of 80 – 90 while students in the age 50 – 60 are predominant. This most likely will be doctoral or post-doctoral students.

从上面的直方图中,我们可以看到,没有80-90岁的学生,而50-60岁的学生占多数。这很可能是博士生或博士后。

We can also say the histogram has a normal distribution.

我们也可以说直方图具有正态分布。

5. Conclusion

5.总结

In this article, we’ve looked at how to harness the power of the Frequency class of Apache commons-math3 library.

在这篇文章中,我们研究了如何利用Apache commons-math3库的Frequency类的力量。

There are other interesting classes for statistics, geometry, genetic algorithms and others in the library. Its documentation can be found here.

在该库中还有其他有趣的统计学、几何学、遗传算法等类。其文档可以在这里找到。

The complete source code is available over at Github.

完整的源代码可在Github上获得。