Calculate Percentiles in Java – 用 Java 计算百分位数

最后修改: 2024年 3月 19日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

When it comes to analyzing data in Java, calculating percentiles is a fundamental task in understanding the statistical distribution and characteristics of a numeric dataset.

在 Java 中分析数据时,计算百分位数是了解数值数据集的统计分布和特征的一项基本任务。

In this tutorial, we’ll walk through the process of calculating percentiles in Java, providing code examples and explanations along the way.

在本教程中,我们将介绍用 Java 计算百分位数的过程,并提供代码示例和解释。

2. Understanding Percentiles

2.了解百分位数

Before discussing the implementation details, let’s first understand what percentiles are and how they’re commonly used in data analysis.

在讨论实现细节之前,让我们先了解什么是百分位数,以及它们在数据分析中的常用方法。

A percentile is a measure used in statistics indicating the value at or below which a given percentage of observations fall. For instance, the 50th percentile (also known as the median) represents the value below which 50% of the data points fall.

百分位数是统计中使用的一种度量方法,表示一定百分比的观测值所达到或低于的数值。例如,第 50 个百分位数(也称为中位数)表示 50% 的数据点位于该值以下。

It’s worth noting that percentiles are expressed in the same unit of measurement as the input dataset, not in percent. For example, if the dataset refers to monthly salary, the corresponding percentiles will be expressed in USD, EUR, or other currencies.

值得注意的是,百分位数使用与输入数据集相同的计量单位表示,而不是百分比。例如,如果数据集涉及月薪,则相应的百分位数将以美元、欧元或其他货币表示。

Next, let’s see a couple of concrete examples:

接下来,我们来看几个具体的例子:

Input: A dataset with numbers 1-100 unsorted
-> sorted dataset: [1, 2, ... 49, (50), 51, 52, ..100] 
-> The 50th percentile: 50

Input: [-1, 200, 30, 42, -5, 7, 8, 92]
-> sorted dataset: [-2, -1, 7, (8), 30, 42, 92, 200]
-> The 50th percentile: 8

Percentiles are often used to understand data distribution, identify outliers, and compare different datasets. They’re particularly useful when dealing with large datasets or when succinctly summarizing a dataset’s characteristics.

百分位数通常用于了解数据分布、识别异常值以及比较不同的数据集。在处理大型数据集或简明扼要地总结数据集特征时,它们尤其有用。

Next, let’s see how to calculate percentiles in Java.

接下来,让我们看看如何用 Java 计算百分位数。

3. Calculating Percentile From a Collection

3.从集合中计算百分位数

Now that we understand what percentiles are. Let’s summarize a step-by-step guide to implementing the percentile calculation:

现在我们了解了什么是百分位数。让我们总结一下实现百分位数计算的分步指南:

  • Sort the given dataset in ascending order
  • Calculate the rank of the required percentile as (percentile / 100) * dataset.size
  • Take the ceiling value of the rank, as the rank can be a decimal number
  • The final result is the element at the index ceiling(rank) – 1 in the sorted dataset

Next, let’s create a generic method to implement the above logic:

接下来,让我们创建一个通用方法来实现上述逻辑:

static <T extends Comparable<T>> T getPercentile(Collection<T> input, double percentile) {
    if (input == null || input.isEmpty()) {
        throw new IllegalArgumentException("The input dataset cannot be null or empty.");
    }
    if (percentile < 0 || percentile > 100) {
        throw new IllegalArgumentException("Percentile must be between 0 and 100 inclusive.");
    }
    List<T> sortedList = input.stream()
      .sorted()
      .collect(Collectors.toList());

    int rank = percentile == 0 ? 1 : (int) Math.ceil(percentile / 100.0 * input.size());
    return sortedList.get(rank - 1);
}

As we can see, the implementation above is pretty straightforward. However, it’s worth mentioning a couple of things:

正如我们所看到的,上面的实现过程非常简单。不过,有几件事值得一提:

  • The validation of the percentile parameter is required ( 0<= percentile <= 100)
  • We sorted the input dataset using the Stream API and collected the sorted result in a new list to avoid modifying the original dataset

Next, let’s test our getPercentile() method.

接下来,让我们测试 getPercentile() 方法。

4. Testing the getPercentile() Method

4.测试 getPercentile() 方法

First, the method should throw an IllegalArgumentException if the percentile is out of the valid range:

首先,如果百分位数超出有效范围,该方法应抛出 IllegalArgumentException 异常:

assertThrows(IllegalArgumentException.class, () -> getPercentile(List.of(1, 2, 3), -1));
assertThrows(IllegalArgumentException.class, () -> getPercentile(List.of(1, 2, 3), 101));

We used the assertThrows() method to verify if the expected exception was raised.

我们使用assertThrows() 方法来验证是否引发了预期的异常

Next, let’s take a List of 1-100 as the input to verify whether the method can produce the expected result:

接下来,让我们以 1-100 的 List 作为输入,来验证该方法是否能产生预期结果:

List<Integer> list100 = IntStream.rangeClosed(1, 100)
  .boxed()
  .collect(Collectors.toList());
Collections.shuffle(list100);
 
assertEquals(1, getPercentile(list100, 0));
assertEquals(10, getPercentile(list100, 10));
assertEquals(25, getPercentile(list100, 25));
assertEquals(50, getPercentile(list100, 50));
assertEquals(76, getPercentile(list100, 75.3));
assertEquals(100, getPercentile(list100, 100));

In the above code, we prepared the input list through an IntStream. Further, we used the shuffle() method to sort the 100 numbers randomly.

在上述代码中,我们通过IntStream准备了输入列表。此外,我们使用 shuffle() 方法对 100 个数字进行随机排序。

Additionally, let’s test our method with another dataset input:

此外,让我们用另一个数据集输入来测试我们的方法:

List<Integer> list8 = IntStream.of(-1, 200, 30, 42, -5, 7, 8, 92)
  .boxed()
  .collect(Collectors.toList());

assertEquals(-5, getPercentile(list8, 0));
assertEquals(-5, getPercentile(list8, 10));
assertEquals(-1, getPercentile(list8, 25));
assertEquals(8, getPercentile(list8, 50));
assertEquals(92, getPercentile(list8, 75.3));
assertEquals(200, getPercentile(list8, 100));

5. Calculating Percentile From an Array

5.从数组计算百分位数

Sometimes, the given dataset input is an array instead of a Collection. In this case, we can first convert the input array to a List and then utilize our getPercentile() method to calculate the required percentiles.

有时,给定的数据集输入是 数组,而不是 Collection 。在这种情况下,我们可以首先将输入数组转换为列表,然后利用我们的getPercentile()方法计算所需的百分位数。

Next, let’s demonstrate how to achieve this by taking a long array as the input:

接下来,让我们演示如何将 long 数组作为输入来实现这一目标:

long[] theArray = new long[] { -1, 200, 30, 42, -5, 7, 8, 92 };
 
//convert the long[] array to a List<Long>
List<Long> list8 = Arrays.stream(theArray)
  .boxed()
  .toList();
 
assertEquals(-5, getPercentile(list8, 0));
assertEquals(-5, getPercentile(list8, 10));
assertEquals(-1, getPercentile(list8, 25));
assertEquals(8, getPercentile(list8, 50));
assertEquals(92, getPercentile(list8, 75.3));
assertEquals(200, getPercentile(list8, 100));

As the code shows, since our input is an array of primitives (long[]), we employed Arrays.stream() to convert it to List<Long>. Then, we can pass the converted List to the getPercentile() to get the expected result.

如代码所示,由于我们的输入是基元数组long[]),因此我们使用 Arrays.stream() 将其转换为 List<Long>. 然后,我们可以将转换后的 List 传递给 getPercentile() 以获得预期结果。

6. Conclusion

6.结论

In this article, we first discussed the underlying principles of percentiles. Then, we explored how to compute percentiles for a dataset in Java.

在本文中,我们首先讨论了百分位数的基本原理。然后,我们探讨了如何用 Java 计算数据集的百分位数。

As always, the complete source code for the examples is available over on GitHub.

与往常一样,这些示例的完整源代码可在 GitHub 上获取。