Calculate Weighted Mean in Java – 用 Java 计算加权平均值

最后修改: 2024年 2月 6日

中文/混合/英文(键盘快捷键:t)

1. Introduction

1.导言

In this article, we’re going to explore a few different ways to solve the same problem – calculating the weighted mean of a set of values.

在本文中,我们将探讨解决同一问题的几种不同方法–计算一组数值的加权平均值。

2. What Is a Weighted Mean?

2.什么是加权平均数?

We calculate the standard mean of a set of numbers by summing all of the numbers and then dividing this by the count of the numbers. For example, the mean of the numbers 1, 3, 5, 7, 9 will be (1 + 3 + 5 + 7 + 9) / 5, which equals 5.

我们计算一组数字的标准平均数的方法是将所有数字相加,然后除以数字的个数。例如,数字 1、3、5、7、9 的平均数是 (1 + 3 + 5 + 7 + 9) /5,等于 5。

When we’re calculating a weighted mean, we instead have a set of numbers that each have weights:

当我们计算加权平均数时,我们会得到一组各有权重的数字

Number Weight
1 10
3 20
5 30
7 50
9 40

In this case, we need to take the weights into account. The new calculation is to sum the product of each number with its weight and divide this by the sum of all the weights. For example, here the mean will be ((1 * 10) + (3 * 20) + (5 * 30) + (7 * 50) + (9 * 40)) / (10 + 20 + 30 + 50 + 40), which equals 6.2.

在这种情况下,我们需要考虑权重。新的计算方法是将每个数字与其权重的乘积相加,然后除以所有权重的总和。例如,这里的平均数将是((1 * 10)+(3 * 20)+(5 * 30)+(7 * 50)+(9 * 40))/(10 + 20 + 30 + 50 + 40),等于 6.2。

3. Setting Up

3.设置

For the sake of these examples, we’ll do some initial setup. The most important thing is that we need a type to represent our weighed values:

在这些示例中,我们将进行一些初始设置。最重要的是,我们需要一个类型来表示我们的权重值

private static class Values {
    int value;
    int weight;

    public Values(int value, int weight) {
        this.value = value;
        this.weight = weight;
    }
}

In our sample code, we’ll also have an initial set of values and an expected result from our average:

在示例代码中,我们还将有一组初始值和平均值的预期结果:

private List<Values> values = Arrays.asList(
    new Values(1, 10),
    new Values(3, 20),
    new Values(5, 30),
    new Values(7, 50),
    new Values(9, 40)
);

private Double expected = 6.2;

4. Two-Pass Calculation

4.两次计算

The most obvious way to calculate this is exactly as we saw above. We can iterate over the list of numbers and separately sum the values that we need for our division:

最明显的计算方法就是我们上面看到的方法。我们可以遍历数字列表,分别求和除法所需的值

double top = values.stream()
  .mapToDouble(v -> v.value * v.weight)
  .sum();
double bottom = values.stream()
  .mapToDouble(v -> v.weight)
  .sum();

Having done this, our calculation is now just a case of dividing one by the other:

这样,我们的计算就只需用一个除以另一个了:

double result = top / bottom;

We can simplify this further by using a traditional for loop instead, and doing the two sums as we go. The downside here is that the results can’t be immutable values:

我们可以使用传统的for循环来进一步简化这一过程,并在执行过程中进行两次求和。这样做的缺点是,结果不能是不可变的值:

double top = 0;
double bottom = 0;

for (Values v : values) {
    top += (v.value * v.weight);
    bottom += v.weight;
}

5. Expanding the List

5.扩展列表

We can think about our weighted average calculation in a different way. Instead of calculating a sum of products, we can instead expand each of the weighted values. For example, we can expand our list to contain 10 copies of “1”, 20 copies of “2”, and so on. At this point, we can do a straight average on the expanded list:

我们可以换一种思路来计算加权平均值。例如,我们可以将列表扩展为包含 10 份 “1”、20 份 “2”,以此类推。此时,我们可以对扩展后的列表进行直接求平均值:

double result = values.stream()
  .flatMap(v -> Collections.nCopies(v.weight, v.value).stream())
  .mapToInt(v -> v)
  .average()
  .getAsDouble();

This is obviously going to be less efficient, but it may also be clearer and easier to understand. We can also more easily do other manipulations on the final set of numbers — for example, finding the median is much easier to understand this way.

这显然会降低效率,但也可能更清晰易懂。我们还可以更轻松地对最终的数字集进行其他操作,例如,通过这种方法查找 中值就更容易理解了。

6. Reducing the List

6.减少清单

We’ve seen that summing the products and weights is more efficient than trying to expand out the values. But what if we want to do this in a single pass without using mutable values? We can achieve this using the reduce() functionality from Streams. In particular, we’ll use this to perform our addition as we go, collecting the running totals into an object as we go.

我们已经看到,对乘积和权重求和要比扩展数值更有效率。但是,如果我们想在不使用可变值的情况下一次性完成这项工作,该怎么办呢?我们可以使用流中的 reduce() 函数来实现这一目标。特别是,我们将使用该函数边执行加法,边将运行中的总数收集到一个对象中。

The first thing we want is a class to collect our running totals into:

首先,我们需要一个类来收集运行总数:

class WeightedAverage {
    final double top;
    final double bottom;

    public WeightedAverage(double top, double bottom) {
        this.top = top;
        this.bottom = bottom;
    }

    double average() {
        return top / bottom;
    }
}

We’ve also included an average() function on this that will do our final calculation. Now, we can perform our reduction:

我们还在其中加入了一个 average() 函数,用于进行最终计算。现在,我们可以进行还原了:

double result = values.stream()
  .reduce(new WeightedAverage(0, 0),
    (acc, next) -> new WeightedAverage(
      acc.top + (next.value * next.weight),
      acc.bottom + next.weight),
    (left, right) -> new WeightedAverage(
      left.top + right.top,
      left.bottom + right.bottom))
  .average();

This looks very complicated, so let’s break it down into parts.

这看起来非常复杂,让我们把它分成几个部分。

The first parameter to reduce() is our identity. This is the weighted average with values of 0.

reduce() 的第一个参数是我们的标识。这是值为 0 的加权平均值。

The next parameter is a lambda that takes a WeightedAverage instance and adds the next value to it. We’ll notice that our sum here is calculated in the same way as what we performed earlier.

下一个参数是一个 lambda,它接收一个 WeightedAverage 实例,并将下一个值加入其中。我们会注意到,这里的总和计算方法与我们之前执行的方法相同。

The final parameter is a lambda for combining two WeightedAverage instances. This is necessary for certain cases with reduce(), such as if we were doing this on a parallel stream.

最后一个参数是一个 lambda,用于组合两个 WeightedAverage 实例。在使用 reduce() 的某些情况下,例如我们在并行流上执行此操作时,这是必要的。

The result of the reduce() call is then a WeightedAverage instance that we can use to calculate our result.

reduce() 调用的结果是一个 WeightedAverage 实例,我们可以用它来计算结果。

7. Custom Collectors

7.定制收藏家

Our reduce() version is certainly clean, but it’s harder to understand than our other attempts. We’ve ended up with two lambdas being passed into the function, and still needing to perform a post-processing step to calculate the average.

我们的reduce()版本当然很简洁,但与其他尝试相比更难理解。我们最终在函数中传递了两个 lambdas,而且还需要执行一个后处理步骤来计算平均值。

One final solution that we can explore is writing a custom collector to encapsulate this work. This will directly produce our result, and it’ll be much simpler to use.

我们可以探索的最后一个解决方案是编写一个自定义收集器来封装这项工作。这将直接产生我们的结果,而且使用起来会简单得多。

Before we write our collector, let’s look at the interface we need to implement:

在编写收集器之前,我们先来看看需要实现的接口:

public interface Collector<T, A, R> {
    Supplier<A> supplier();
    BiConsumer<A, T> accumulator();
    BinaryOperator<A> combiner();
    Function<A, R> finisher();
    Set<Characteristics> characteristics();
}

There’s a lot going on here, but we’ll work through it as we build our collector. We’ll also see how some of this extra complexity allows us to use the exact same collector on a parallel stream instead of only on a sequential stream.

这里有很多问题,但我们会在构建收集器时一一解决。我们还将看到这些额外的复杂性如何让我们在 并行流上使用完全相同的收集器,而不是只在顺序流上使用。

The first thing to note is the generic types:

首先要注意的是通用类型:

  • T – This is the input type. Our collector always needs to be tied to the type of values that it can collect.
  • R – This is the result type. Our collector always needs to specify the type it will produce.
  • A – This is the aggregation type. This is typically internal to the collector but is necessary for some of the function signatures.

This means that we need to define an aggregation type. This is just a type that collects a running result as we’re going. We can’t just do this directly in our collector because we need to be able to support parallel streams, where there might be an unknown number of these going on at once. As such, we define a separate type that stores the results from each parallel stream:

这意味着我们需要定义一个聚合类型。我们不能直接在收集器中实现这一点,因为我们需要支持并行流,因为并行流可能会同时出现未知数量的并行流。因此,我们定义了一个单独的类型来存储每个并行流的结果:

class RunningTotals {
    double top;
    double bottom;

    public RunningTotals() {
        this.top = 0;
        this.bottom = 0;
    }
}

This is a mutable type, but because its use will be constrained to one parallel stream, that’s okay.

这是一个可变类型,但由于其使用仅限于一个并行数据流,因此也没有关系。

Now, we can implement our collector methods. We’ll notice that most of these return lambdas. Again, this is to support parallel streams where the underlying streams framework will call some combination of them as appropriate.

现在,我们可以实现我们的收集器方法。我们会注意到,其中大部分方法都会返回 lambdas。这同样是为了支持并行流,底层流框架会根据情况调用其中的一些组合。

The first method is supplier(). This constructs a new, zero instance of our RunningTotals:

第一个方法是supplier()它将为我们的 RunningTotals 构造一个新的零实例:

@Override
public Supplier<RunningTotals> supplier() {
    return RunningTotals::new;
}

Next, we have accumulator(). This takes a RunningTotals instance and the next Values instance to process and combines them, updating our RunningTotals instance in place:

接下来是 accumulator() 。它会获取一个 RunningTotals 实例和下一个要处理的 Values 实例,然后将它们合并,更新我们的 RunningTotals 实例:

@Override
public BiConsumer<RunningTotals, Values> accumulator() {
    return (current, next) -> {
        current.top += (next.value * next.weight);
        current.bottom += next.weight;
    };
}

Our next method is combiner(). This takes two RunningTotals instances – from different parallel streams – and combines them into one:

我们的下一个方法是 combiner() 。该方法从不同的并行流中获取两个 RunningTotals 实例,并将它们合并为一个:

@Override
public BinaryOperator<RunningTotals> combiner() {
    return (left, right) -> {
        left.top += right.top;
        left.bottom += right.bottom;

        return left;
    };
}

In this case, we’re mutating one of our inputs and directly returning that. This is perfectly safe, but we can also return a new instance if that’s easier.

在这种情况下,我们会更改其中一个输入,然后直接返回。这样做非常安全,但如果更方便的话,我们也可以返回一个新实例。

This will only be used if the JVM decides to split the stream processing into multiple parallel streams, which depends on several factors. However, we should implement it in case this does ever happen.

这只有在 JVM 决定将流处理分成多个并行流时才会使用,而这取决于多个因素。不过,我们还是应该实现它,以防万一。

The final lambda method that we need to implement is finisher(). This takes the final RunningTotals instance that is left after all of the values have been accumulated and all of the parallel streams have been combined, and returns the final result:

我们需要实现的最后一个 lambda 方法是 finisher() 。该方法获取在累积所有值和合并所有并行流后剩下的最终 RunningTotals 实例,并返回最终结果:

@Override
public Function<RunningTotals, Double> finisher() {
    return rt -> rt.top / rt.bottom;
}

Our Collector also needs a characteristics() method that returns a set of characteristics describing how the collector can be used. The Collectors.Characteristics enum consists of three values:

我们的收集器还需要一个characteristics()方法,该方法可返回一组描述如何使用收集器的特性Collectors.Characteristics 枚举包含三个值:

  • CONCURRENT – The accumulator() function is safe to call on the same aggregation instance from parallel threads. If this is specified, then the combiner() function will never be used, but the aggregation() function must take extra care.
  • UNORDERED – The collector can safely process the elements from the underlying stream in any order. If this isn’t specified, then, where possible, the values will be provided in the correct order.
  • IDENTITY_FINISH – The finisher() function just directly returns its input. If this is specified, then the collection process may short-circuit this call and just return the value directly.

In our case, we have an UNORDERED collector but need to omit the other two:

在我们的例子中,我们有一个UNORDERED收集器,但需要省略其他两个

@Override
public Set<Characteristics> characteristics() {
    return Collections.singleton(Characteristics.UNORDERED);
}

We’re now ready to use our collector:

现在我们可以使用采集器了:

double result = values.stream().collect(new WeightedAverage());

While writing the collector is much more complicated than before, using it is significantly easier. We can also leverage things like parallel streams with no extra work, meaning that this gives us an easier-to-use and more powerful solution, assuming that we need to reuse it.

虽然编写收集器比以前复杂得多,但使用它却容易得多。我们还可以利用并行流等功能,而无需额外工作,这意味着,假设我们需要重用收集器,它将为我们提供一个更易于使用、功能更强大的解决方案。

8. Conclusion

8.结论

Here, we’ve seen several different ways that we can calculate a weighted average of a set of values, ranging from simply looping over the values ourselves to writing a full Collector instance that can be reused whenever we need to perform this calculation. Next time you need to do this, why not give one of these a go?

在这里,我们看到了计算一组值的加权平均值的几种不同方法,从简单地循环遍历这些值,到编写一个完整的 Collector 实例,只要我们需要执行这种计算,就可以重复使用该实例。下一次,当您需要执行此操作时,为什么不试试这些方法呢?

As always, the full code for this article is available over on GitHub.

与往常一样,本文的完整代码可在 GitHub 上获取。