A Collaborative Filtering Recommendation System in Java – Java中的协作过滤推荐系统

最后修改: 2016年 12月 27日

中文/混合/英文(键盘快捷键:t)

1. Introduction

1.介绍

In this tutorial, we’ll learn all about the Slope One algorithm in Java.

在本教程中,我们将学习Java中的Slope One算法的所有知识。

We’ll also show the example implementation for the problem of Collaborative Filtering (CF) – a machine learning technique used by recommendation systems.

我们还将展示协作过滤(CF)问题的实例实现–这是一种机器学习技术被推荐系统使用

This can be used, for example, to predict user interests for specific items.

例如,这可用于预测用户对特定项目的兴趣。

2. Collaborative Filtering

2.协作过滤

The Slope One algorithm is an item-based collaborative filtering system. It means that it is completely based on the user-item ranking. When we compute the similarity between objects, we only know the history of rankings, not the content itself. This similarity is then used to predict potential user rankings for user-item pairs not present in the dataset.

Slope One算法是一个基于项目的协同过滤系统。这意味着它完全基于用户-物品的排名。当我们计算对象之间的相似度时,我们只知道排名的历史,而不是内容本身。这种相似性然后被用来预测数据集中不存在的用户-项目对的潜在用户排名。

The image below show the complete process of obtaining and calculating the rating for the specific user:

下面的图片显示了为特定用户获取和计算评级的完整过程。

Collaborative filtering new

At first, users rate different items in the system. Next, the algorithm calculates the similarities. After that, the system makes predictions for user-item ratings, which the user hasn’t rated yet.

首先,用户对系统中的不同项目进行评分。接下来,算法会计算出相似性。之后,系统对用户-项目评级进行预测,而用户还没有对这些项目进行评级。

For more details on the topic of the collaborative filtering, we can refer to the Wikipedia article.

关于协同过滤主题的更多细节,我们可以参考维基百科文章

3. The Slope One Algorithm

3.斜率一算法

Slope One was named as the simplest form of non-trivial item-based collaborative filtering based on ratings. It takes into account both information from all users who rated the same item and from the other items rated by the same user to calculate the similarity matrix.

斜率一被命名为最简单的基于评分的非琐碎的项目协作过滤形式。它既考虑了所有对同一项目进行评分的用户的信息,也考虑了同一用户评分的其他项目的信息,以计算出相似度矩阵。

In our simple example, we are going to predict user rankings on the items in the store.

在我们的简单例子中,我们要预测用户对商店里的商品的排名。

Let’s start with a simple Java model for our problem and domain.

让我们从我们的问题和领域的一个简单的Java模型开始。

3.1. Java Model

3.1.Java模型

In our model we have two main objects – items and users. The Item class contains the name of the item:

在我们的模型中,我们有两个主要对象–项目和用户。Item类包含项目的名称。

private String itemName;

On the other hand, the User class contains the username:

另一方面,User类包含用户名。

private String username;

Finally, we have a InputData class that will be used to initialize the data. Let us assume that we’ll create five different products in the store:

最后,我们有一个InputData类,将用于初始化数据。让我们假设,我们将在商店中创建五个不同的产品。

List<Item> items = Arrays.asList(
  new Item("Candy"), 
  new Item("Drink"), 
  new Item("Soda"), 
  new Item("Popcorn"), 
  new Item("Snacks")
);

Moreover, we’ll create three users that randomly rated some of the above-mentioned using the scale from 0.0-1.0, where 0 means no interest, 0.5 somehow interested and 1.0 means totally interested. As a result of data initialization we’ll get a Map with user item-ranking data:

此外,我们将创建三个用户,用0.0-1.0的量表对上述的一些项目进行随机评分,其中0表示没有兴趣,0.5表示有点兴趣,1.0表示完全感兴趣。作为数据初始化的结果,我们将得到一个包含用户项目排名数据的Map

Map<User, HashMap<Item, Double>> data;

3.2. Differences and Frequencies Matrices

3.2.差异和频率矩阵

Based on the available data, we’ll calculate the relationships between the items, as well as the number of items’ occurrences. For each user, we check his/her rating of the items:

根据现有的数据,我们将计算出项目之间的关系,以及项目出现的数量。对于每个用户,我们检查他/她对项目的评分。

for (HashMap<Item, Double> user : data.values()) {
    for (Entry<Item, Double> e : user.entrySet()) {
        // ...
    }
}

In the next step, we check if the item is existing in our matrices. If this is a first occurrence, we create the new entry in the maps:

在下一步,我们检查该项目是否存在于我们的矩阵中。如果这是第一次出现,我们就在地图中创建新条目。

if (!diff.containsKey(e.getKey())) {
    diff.put(e.getKey(), new HashMap<Item, Double>());
    freq.put(e.getKey(), new HashMap<Item, Integer>());
}

The first matrix is used to calculate the differences between the user ratings. The values of it might be positive or negative (since the difference between ratings might be negative), and are stored as Double. On the other hand, the frequencies are stored as Integer values.

第一个矩阵用于计算用户评级之间的差异。它的值可能是正的,也可能是负的(因为评级之间的差异可能是负的),并被存储为Double。另一方面,频率被存储为Integer值。

In the next step we are going to compare the ratings of all items:

在下一步,我们将比较所有项目的评级。

for (Entry<Item, Double> e2 : user.entrySet()) {
    int oldCount = 0;
    if (freq.get(e.getKey()).containsKey(e2.getKey())){
        oldCount = freq.get(e.getKey()).get(e2.getKey()).intValue();
    }

    double oldDiff = 0.0;
    if (diff.get(e.getKey()).containsKey(e2.getKey())){
        oldDiff = diff.get(e.getKey()).get(e2.getKey()).doubleValue();
    }
    
    double observedDiff = e.getValue() - e2.getValue();
    freq.get(e.getKey()).put(e2.getKey(), oldCount + 1);
    diff.get(e.getKey()).put(e2.getKey(), oldDiff + observedDiff);
}

If somebody rated the item before, we increase the frequency count by one. Moreover, we check the average difference between the item’s ratings and calculate the new observedDiff.

如果之前有人给这个项目评分,我们就把频率数增加一个。此外,我们检查该项目评级之间的平均差异,并计算新的observedDiff

Please note, that we put the sum of oldDiff and observedDiff as a new value of an item.

请注意,我们把oldDiffobservedDiff的总和作为一个项目的新值。

Finally, we calculate the similarity scores inside the matrices:

最后,我们计算矩阵内部的相似度分数。

for (Item j : diff.keySet()) {
    for (Item i : diff.get(j).keySet()) {
        double oldValue = diff.get(j).get(i).doubleValue();
        int count = freq.get(j).get(i).intValue();
        diff.get(j).put(i, oldValue / count);
    }
}

The main logic is to divide the calculated item rating’s difference by the number of its occurrences. After that step, we can print out our final differences matrix.

主要的逻辑是将计算出的项目评级的差异除以其出现的次数。在这一步之后,我们可以打印出我们最终的差异矩阵。

3.3. Predictions

3.3.预测

As the main part of the Slope One, we are going to predict all missing ratings based on the existing data. In order to do that, we need to compare the user-item ratings with differences matrix calculated in the previous step:

作为Slope One的主要部分,我们要根据现有的数据来预测所有缺失的评分。为了做到这一点,我们需要将用户-项目评分与上一步计算的差异矩阵进行比较。

for (Entry<User, HashMap<Item, Double>> e : data.entrySet()) {
    for (Item j : e.getValue().keySet()) {
        for (Item k : diff.keySet()) {
            double predictedValue =
              diff.get(k).get(j).doubleValue() + e.getValue().get(j).doubleValue();
            double finalValue = predictedValue * freq.get(k).get(j).intValue();
            uPred.put(k, uPred.get(k) + finalValue);
            uFreq.put(k, uFreq.get(k) + freq.get(k).get(j).intValue());
        }
    }
    // ...
}

After that, we need to prepare the “clean” predictions using the code below:

之后,我们需要使用下面的代码准备 “干净 “的预测。

HashMap<Item, Double> clean = new HashMap<Item, Double>();
for (Item j : uPred.keySet()) {
    if (uFreq.get(j) > 0) {
        clean.put(j, uPred.get(j).doubleValue() / uFreq.get(j).intValue());
    }
}
for (Item j : InputData.items) {
    if (e.getValue().containsKey(j)) {
        clean.put(j, e.getValue().get(j));
    } else if (!clean.containsKey(j)) {
        clean.put(j, -1.0);
    }
}

The trick to consider with larger data set is to use only the item entries that have a large frequency value (for example > 1). Please note, that if the prediction is not possible, the value of it will be equal to -1.

对于较大的数据集,需要考虑的诀窍是只使用具有较大频率值(例如>1)的项目条目。请注意,如果预测不可能,它的值将等于-1。

Finally, very important note. If our algorithm worked correctly, we should receive the predictions for items that user didn’t rate, but also the repeated ratings for the items that he rated. Those repeated rating’s should not change, otherwise it means that there is a bug in your algorithm implementation.

最后,非常重要的一点。如果我们的算法工作正常,我们应该收到用户没有评分的项目的预测,但也应该收到他评分的项目的重复评分。这些重复的评价不应该改变,否则就意味着你的算法实现中存在一个错误。

3.4. Tips

3.4.提示

There are few major factors that affect Slope One algorithm. Here are the few tips how to increase the accuracy and processing time:

影响Slope One算法的主要因素有几个。以下是如何提高准确性和处理时间的几个技巧。

  • consider obtaining the user-item ratings on the DB side for large data sets
  • set the time frame for ratings fetching, as people interests might change over the time – it will also reduce the time required to process input data
  • split large data sets into smaller ones – you don’t need to calculate predictions for all users every day; you can check if the user interacted with the predicted item and then add/remove him/her from processing queue for the next day

4. Conclusion

4.结论

In this tutorial we were able to learn about the Slope One algorithm. Moreover, we introduced the collaborative filtering problem for item recommendation systems.

在本教程中,我们能够了解到Slope One算法。此外,我们还介绍了物品推荐系统的协同过滤问题。

The full implementation of this tutorial can be found in the GitHub project.

本教程的完整实现可以在GitHub项目中找到。