1. Overview
1.概述
Strings commonly contain a mixture of words and other delimiters. Sometimes, these strings may delimit words by a change in the case without whitespace. For example, the camel case capitalizes each word after the first, and the title case (or Pascal case) capitalizes every word.
字符串通常包含单词和其他定界符的混合物。有时,这些字符串可以通过改变大小写来给单词划定界限,而不留空白。例如,camel大小写在第一个字之后的每个字都大写,而title大小写(或Pascal大小写)则是每个字都大写。
We may wish to parse these strings back into words in order to process them.
我们可能希望将这些字符串解析成单词,以便处理它们。
In this short tutorial, we’ll look at how to find the words in mixed case strings using regular expressions, and how to convert them into sentences or titles.
在这个简短的教程中,我们将看看如何使用正则表达式找到混合大小写字符串中的单词,以及如何将它们转换为句子或标题。
2. Use Cases for Parsing Capitalized Strings
2.解析大写字母字符串的用例
A common use case for processing camel case strings might be the field names in a document. Let’s say a document has a field “firstName” – we may wish to display that on-screen as “First name” or “First Name”.
处理骆驼字母字符串的一个常见用例可能是文档中的字段名。比方说,一个文档有一个字段“firstName” –我们可能希望在屏幕上显示为 “First name “或 “First Name”.。
Similarly, if we were to scan the types or functions in our application via reflection, in order to produce reports using their names, we would commonly find camel case or title case identifiers that we may wish to convert.
同样,如果我们要通过反射来扫描我们应用程序中的类型或函数,以便使用它们的名字来产生报告,我们通常会发现骆驼大写或标题大写的标识符,我们可能希望将其转换。
An extra problem we need to solve when parsing these expressions is that single-letter words cause consecutive capital letters.
在解析这些表达式时,我们需要解决的一个额外问题是,单字母词会导致连续大写字母。
For clarity:
为了清楚起见。
- thisIsAnExampleOfCamelCase
- ThisIsTitleCase
- thisHasASingleLetterWord
Now that we know the sorts of identifiers we need to parse, let’s use a regular expression to find the words.
现在我们知道了我们需要解析的标识符的种类,让我们用正则表达式来寻找这些词。
3. Find Words Using Regular Expressions
3.使用正则表达式查找单词
3.1. Defining a Regular Expression to Find Words
3.1.定义一个正则表达式来查找单词
Let’s define a regular expression to locate words that are either made of lowercase letters only, a single uppercase letter followed by lowercase letters, or a single uppercase letter on its own:
让我们定义一个正则表达式来定位只由小写字母组成的词,一个大写字母后跟小写字母,或者一个大写字母本身。
Pattern WORD_FINDER = Pattern.compile("(([A-Z]?[a-z]+)|([A-Z]))");
This expression provides the regular expression engine with two options. The first uses “[A-Z]?” to mean “an optional first capital letter” and then “[a-z]+” to mean “one or more lowercase letters”. After that, there’s the “|” character to provide or logic, followed by the expression “[A-Z]”, which means “a single capital letter”.
这个表达式为正则表达式引擎提供了两个选项。第一个使用“[A-Z]?”表示 “一个可选的第一个大写字母”,然后“[a-z]+”表示 “一个或多个小写字母”。在这之后,还有“|”字符来提供or逻辑,然后是表达式“[A-Z]”,表示 “一个大写字母”。
Now that we have the regular expression, let’s parse our strings.
现在我们有了正则表达式,让我们来解析我们的字符串。
3.2. Finding Words in a String
3.2.在一个字符串中寻找单词
We’ll define a method to use this regular expression:
我们将定义一个方法来使用这个正则表达式。
public List<String> findWordsInMixedCase(String text) {
Matcher matcher = WORD_FINDER.matcher(text);
List<String> words = new ArrayList<>();
while (matcher.find()) {
words.add(matcher.group(0));
}
return words;
}
This uses the Matcher created by the regular expression’s Pattern to help us find the words. We iterate over the matcher while it still has matches, adding them to our list.
这使用由正则表达式的Pattern创建的Matcher来帮助我们找到这些词。我们在匹配器上进行迭代,同时它仍然有匹配项,将它们添加到我们的列表中。
This should extract anything that meets our word definition. Let’s test it.
这应该能提取出任何符合我们单词定义的东西。让我们来测试一下。
3.3. Testing the Word Finder
3.3.测试找词器
Our word finder should be able to find words that are separated by any non-word characters, as well as by changes in the case. Let’s start with a simple example:
我们的找词器应该能够找到被任何非单词字符分隔的单词,以及被大小写变化分隔的单词。让我们从一个简单的例子开始。
assertThat(findWordsInMixedCase("some words"))
.containsExactly("some", "words");
This test passes and shows us that our algorithm is working. Next, we’ll try the camel case:
这个测试通过了,表明我们的算法是有效的。接下来,我们将尝试骆驼的情况。
assertThat(findWordsInMixedCase("thisIsCamelCaseText"))
.containsExactly("this", "Is", "Camel", "Case", "Text");
Here we see that the words are extracted from a camel case String and come out with their capitalization unchanged. For example, “Is” started with a capital letter in the original text, and is capitalized when extracted.
在这里,我们看到,这些词从一个骆驼大写的字符串中提取出来时,其大写字母没有改变。例如,“Is”在原文中是以大写字母开始的,而在提取时是大写的。
We can also try this with title case:
我们也可以用标题案来尝试。
assertThat(findWordsInMixedCase("ThisIsTitleCaseText"))
.containsExactly("This", "Is", "Title", "Case", "Text");
Plus, we can check that single letter words are extracted as we intended:
另外,我们可以检查单字母词是否按照我们的意图被提取出来。
assertThat(findWordsInMixedCase("thisHasASingleLetterWord"))
.containsExactly("this", "Has", "A", "Single", "Letter", "Word");
So far, we’ve built a word extractor, but these words are capitalized in a way that may not be ideal for output.
到目前为止,我们已经建立了一个单词提取器,但这些单词的大写方式可能对输出不理想。
4. Convert Word List to Human Readable Format
4.将单词列表转换为人类可读的格式
After extracting a list of words, we probably want to use methods like toUpperCase or toLowerCase to normalize them. Then we can use String.join to join them back into a single string with a delimiter. Let’s look at a couple of ways to achieve real-world use cases with these.
在提取了一个单词列表之后,我们可能想使用toUpperCase或者toLowerCase这样的方法来规范它们。然后我们可以使用String.join将它们重新连接成一个带有分隔符的单一字符串。让我们来看看用这些东西实现真实世界用例的几种方法。
4.1. Convert to Sentence
4.1.转换为句子
Sentences start with a capital letter and end in a period – “.”. We’re going to need to be able to make a word start with a capital letter:
句子以大写字母开始,以句号结束 – “.”。我们要能让一个词以大写字母开头。
private String capitalizeFirst(String word) {
return word.substring(0, 1).toUpperCase()
+ word.substring(1).toLowerCase();
}
Then we can loop through the words, capitalizing the first, and making the others lowercase:
然后,我们可以循环浏览这些单词,将第一个单词大写,并将其他单词改为小写。
public String sentenceCase(List<String> words) {
List<String> capitalized = new ArrayList<>();
for (int i = 0; i < words.size(); i++) {
String currentWord = words.get(i);
if (i == 0) {
capitalized.add(capitalizeFirst(currentWord));
} else {
capitalized.add(currentWord.toLowerCase());
}
}
return String.join(" ", capitalized) + ".";
}
The logic here is that the first word has its first character capitalized, and the rest are in lowercase. We join them with a space as the delimiter and add a period in the end.
这里的逻辑是,第一个词的第一个字符是大写的,其余的是小写的。我们用空格作为分隔符将它们连接起来,并在最后加上一个句号。
Let’s test this out:
让我们来测试一下。
assertThat(sentenceCase(Arrays.asList("these", "Words", "Form", "A", "Sentence")))
.isEqualTo("These words form a sentence.");
4.2. Convert to Title Case
4.2.转换为标题大小写
Title case has slightly more complex rules than a sentence. Each word must have a capital letter, unless it’s a special stop word that isn’t normally capitalized. However, the whole title must start with a capital letter.
标题大小写的规则比句子稍微复杂一些。每个词都必须有一个大写字母,除非它是一个特殊的停顿词,通常不会被大写。但是,整个标题必须以大写字母开头。
We can achieve this by defining our stop words:
我们可以通过定义我们的停止词来实现这一点。
Set<String> STOP_WORDS = Stream.of("a", "an", "the", "and",
"but", "for", "at", "by", "to", "or")
.collect(Collectors.toSet());
After this, we can modify the if statement in our loop to capitalize any word that’s not a stop word, as well as the first:
在这之后,我们可以修改循环中的if语句,将任何不是止损词的单词以及第一个单词大写。
if (i == 0 ||
!STOP_WORDS.contains(currentWord.toLowerCase())) {
capitalized.add(capitalizeFirst(currentWord));
}
The algorithm to combine the words is the same, though we don’t add the period in the end.
合并单词的算法是相同的,尽管我们没有在最后加上句号。
Let’s test it out:
让我们来测试一下。
assertThat(capitalizeMyTitle(Arrays.asList("title", "words", "capitalize")))
.isEqualTo("Title Words Capitalize");
assertThat(capitalizeMyTitle(Arrays.asList("a", "stop", "word", "first")))
.isEqualTo("A Stop Word First");
5. Conclusion
5.总结
In this short article, we looked at how to find the words in a String using a regular expression. We saw how to define this regular expression to find different words using capitalization as a word boundary.
在这篇短文中,我们研究了如何使用正则表达式来查找String中的单词。我们看到了如何定义这个正则表达式,使用大写字母作为单词边界来寻找不同的单词。
We also looked at some simple algorithms for taking a list of words and converting them into the correct capitalization for a sentence or a title.
我们还研究了一些简单的算法,用于将一个单词列表转换为一个句子或一个标题的正确大写。
As always, the example code can be found over on GitHub.
一如既往,可以在GitHub上找到示例代码。