Counting Words in a String with Java – 用Java计算一个字符串中的字数

最后修改: 2019年 8月 24日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

In this tutorial, we are going to go over different ways of counting words in a given string using Java.

在本教程中,我们将介绍使用Java计算给定字符串中的单词的不同方法

2. Using StringTokenizer

2.使用StringTokenizer

A simple way to count words in a string in Java is to use the StringTokenizer class:

在Java中计算字符串中的单词的简单方法是使用StringTokenizer类。

assertEquals(3, new StringTokenizer("three blind mice").countTokens());
assertEquals(4, new StringTokenizer("see\thow\tthey\trun").countTokens());

Note that StringTokenizer automatically takes care of whitespace for us, like tabs and carriage returns.

请注意,StringTokenizer自动为我们处理空白处,如制表符和回车符。

But, it might goof-up in some places, like hyphens:

但是,它可能会在某些地方出现错误,比如连字符。

assertEquals(7, new StringTokenizer("the farmer's wife--she was from Albuquerque").countTokens());

In this case, we’d want “wife” and “she” to be different words, but since there’s no whitespace between them, the defaults fail us.

在这种情况下,我们希望 “妻子 “和 “她 “是不同的词,但由于它们之间没有空格,因此默认情况下我们无法做到。

Fortunately, StringTokenizer ships with another constructor. We can pass a delimiter into the constructor to make the above work:

幸运的是,StringTokenizer带有另一个构造函数。我们可以在构造函数中传递一个分隔符,以使上述工作顺利进行。

assertEquals(7, new StringTokenizer("the farmer's wife--she was from Albuquerque", " -").countTokens());

This comes in handy when trying to count the words in a string from something like a CSV file:

当试图从CSV文件之类的文件中计算一个字符串中的字数时,这就很方便了:

assertEquals(10, new StringTokenizer("did,you,ever,see,such,a,sight,in,your,life", ",").countTokens());

So, StringTokenizer is simple, and it gets us most of the way there.

所以,StringTokenizer很简单,它让我们达到了大部分的目的。

Let’s see though what extra horsepower regular expressions can give us.

让我们看看正则表达式能给我们带来什么额外的动力。

3. Regular Expressions

3.正则表达式

In order for us to come up with a meaningful regular expression for this task, we need to define what we consider a word: a word starts with a letter and ends either with a space character or a punctuation mark.

为了使我们能够为这项任务提出一个有意义的正则表达式,我们需要定义我们认为的单词。一个词以一个字母开头,以一个空格字符或一个标点符号结尾

With this in mind, given a string, what we want to do is to split that string at every point we encounter spaces and punctuation marks, then count the resulting words.

考虑到这一点,给定一个字符串,我们要做的是在每一个遇到空格和标点符号的地方分割这个字符串,然后计算产生的单词。

assertEquals(7, countWordsUsingRegex("the farmer's wife--she was from Albuquerque"));

Let’s crank things up a bit to see the power of regex:

让我们把事情做得更大一些,看看regex的力量。

assertEquals(9, countWordsUsingRegex("no&one#should%ever-write-like,this;but:well"));

It is not practical to solve this one through just passing a delimiter to StringTokenizer since we’d have to define a really long delimiter to try and list out all possible punctuation marks.

通过向StringTokenizer传递一个分隔符来解决这个问题是不现实的,因为我们必须定义一个很长的分隔符来尝试列出所有可能的标点符号。

It turns out we really don’t have to do much, passing the regex [\pP\s&&[^’]]+ to the split method of the String class will do the trick:

事实证明,我们真的不需要做太多,将regex [\pPP\s&&[^’]]+ 传给 split 方法的 String 类就可以做到

public static int countWordsUsingRegex(String arg) {
    if (arg == null) {
        return 0;
    }
    final String[] words = arg.split("[\pP\s&&[^']]+");
    return words.length;
}

The regex [\pP\s&&[^’]]+ finds any length of either punctuation marks or spaces and ignores the apostrophe punctuation mark.

重词[\pP\s&&[^’]]+找到任何长度的标点符号或空格,并忽略了撇号标点符号。

To find out more about regular expressions, refer to Regular Expressions on Baeldung.

要了解有关正则表达式的更多信息,请参考Regular Expressions on Baeldung

4. Loops and the String API

4.循环和StringAPI

The other method is to have a flag that keeps track of the words that have been encountered.

另一种方法是有一个标志来记录所遇到的单词。

We set the flag to WORD when encountering a new word and increment the word count, then back to SEPARATOR when we encounter a non-word (punctuation or space characters).

当遇到一个新的单词时,我们将该标志设置为WORD,并增加单词数,当遇到非单词(标点符号或空格字符)时,再将其设置为SEPARATOR

This approach gives us the same results we got with regular expressions:

这种方法给我们带来的结果与我们用正则表达式得到的结果相同。

assertEquals(9, countWordsManually("no&one#should%ever-write-like,this but   well"));

We do have to be careful with special cases where punctuation marks are not really word separators, for example:

我们确实要注意一些特殊情况,例如标点符号并不是真正的单词分隔符

assertEquals(6, countWordsManually("the farmer's wife--she was from Albuquerque"));

What we want here is to count “farmer’s” as one word, although the apostrophe ” ‘ ” is a punctuation mark.

我们在这里要做的是把 “农民的 “算作一个词,尽管撇号”‘”是一个标点符号。

In the regex version, we had the flexibility to define what doesn’t qualify as a character using the regex. But now that we are writing our own implementation, we have to define this exclusion in a separate method:

在regex版本中,我们可以灵活地定义什么不符合使用regex的条件,成为一个字符。但是现在我们要写自己的实现,我们必须在一个单独的方法中定义这种排除法

private static boolean isAllowedInWord(char charAt) {
    return charAt == '\'' || Character.isLetter(charAt);
}

So what we have done here is to allow in a word all characters and legal punctuation marks, the apostrophe in this case.

因此,我们在这里所做的是在一个词中允许所有的字符和合法的标点符号,本例中的撇号。

We can now use this method in our implementation:

我们现在可以在我们的实现中使用这个方法。

public static int countWordsManually(String arg) {
    if (arg == null) {
        return 0;
    }
    int flag = SEPARATOR;
    int count = 0;
    int stringLength = arg.length();
    int characterCounter = 0;

    while (characterCounter < stringLength) {
        if (isAllowedInWord(arg.charAt(characterCounter)) && flag == SEPARATOR) {
            flag = WORD;
            count++;
        } else if (!isAllowedInWord(arg.charAt(characterCounter))) {
            flag = SEPARATOR;
        }
        characterCounter++;
    }
    return count;
}

The first condition marks a word when it encounters one, and increments the counter. The second condition checks if the character is not a letter, and sets the flag to SEPARATOR.

第一个条件在遇到一个字的时候标记一个字,并增加计数器。第二个条件检查该字符是否为字母,并将标志设为SEPARATOR

5. Conclusion

5.总结

In this tutorial, we have looked at ways to count words using several approaches. We can pick any depending on our particular use-case.

在本教程中,我们已经研究了使用几种方法来计算单词的方法。我们可以根据我们的具体使用情况选择任何一种。

As usual, the source code for this tutorial can be found over on GitHub.

像往常一样,本教程的源代码可以在GitHub上找到over