Guide to StreamTokenizer – StreamTokenizer指南

最后修改: 2019年 8月 1日

中文/混合/英文(键盘快捷键:t)

1. Introduction

1.绪论

In this tutorial, we’ll show how to parse a stream of characters into tokens using the Java StreamTokenizer class.

在本教程中,我们将展示如何使用Java StreamTokenizer类将一个字符流解析为标记物。

2. StreamTokenizer

2.StreamTokenizer

The StreamTokenizer class reads the stream character by character. Each of them can have zero or more of the following attributes: white space, alphabetic, numeric, string quote or comment character.

StreamTokenizer类按字符读取流。每个字符可以有0个或更多的以下属性:空白、字母、数字、字符串引号或注释字符。

Now, we need to understand the default configuration. We have the following types of characters:

现在,我们需要了解默认配置。我们有以下类型的字符。

  • Word characters: ranges like ‘a’ to ‘z’ and ‘A’ to ‘Z
  • Numeric characters: 0,1,…,9
  • Whitespace characters: ASCII values from 0 to 32
  • Comment character: /
  • String quote characters: ‘ and “

Note that the ends of lines are treated as whitespaces, not as separate tokens, and the C/C++-style comments are not recognized by default.

注意,行末被视为空白,而不是独立的标记,而且C/C++风格的注释默认不被识别。

This class possesses a set of important fields:

该类拥有一组重要的字段。

  • TT_EOF – A constant indicating the end of the stream
  • TT_EOL – A constant indicating the end of the line
  • TT_NUMBER – A constant indicating a number token
  • TT_WORD – A constant indicating a word token

3. Default Configuration

3.默认配置

Here, we’re going to create an example in order to understand the StreamTokenizer mechanism. We’ll start by creating an instance of this class and then call the nextToken() method until it returns the TT_EOF value:

在这里,我们将创建一个例子,以便理解StreamTokenizer机制。我们将首先创建这个类的一个实例,然后调用nextToken()方法,直到它返回TT_EOF值。

private static final int QUOTE_CHARACTER = '\'';
private static final int DOUBLE_QUOTE_CHARACTER = '"';

public static List<Object> streamTokenizerWithDefaultConfiguration(Reader reader) throws IOException {
    StreamTokenizer streamTokenizer = new StreamTokenizer(reader);
    List<Object> tokens = new ArrayList<Object>();

    int currentToken = streamTokenizer.nextToken();
    while (currentToken != StreamTokenizer.TT_EOF) {

        if (streamTokenizer.ttype == StreamTokenizer.TT_NUMBER) {
            tokens.add(streamTokenizer.nval);
        } else if (streamTokenizer.ttype == StreamTokenizer.TT_WORD
            || streamTokenizer.ttype == QUOTE_CHARACTER
            || streamTokenizer.ttype == DOUBLE_QUOTE_CHARACTER) {
            tokens.add(streamTokenizer.sval);
        } else {
            tokens.add((char) currentToken);
        }

        currentToken = streamTokenizer.nextToken();
    }

    return tokens;
}

The test file simply contains:

该测试文件仅仅包含。

3 quick brown foxes jump over the "lazy" dog!
#test1
//test2

Now, if we printed out the contents of the array, we’d see:

现在,如果我们打印出数组的内容,我们会看到。

Number: 3.0
Word: quick
Word: brown
Word: foxes
Word: jump
Word: over
Word: the
Word: lazy
Word: dog
Ordinary char: !
Ordinary char: #
Word: test1

In order to better understand the example, we need to explain the StreamTokenizer.ttype, StreamTokenizer.nval and StreamTokenizer.sval fields.

为了更好地理解这个例子,我们需要解释StreamTokenizer.ttypeStreamTokenizer.nvalStreamTokenizer.sval字段。

The ttype field contains the type of the token just read. It could be TT_EOF, TT_EOL, TT_NUMBER, TT_WORD. However, for a quoted string token, its value is the ASCII value of the quote character. Moreover, if the token is an ordinary character like ‘!’, with no attributes, then the ttype will be populated with the ASCII value of that character.

ttype字段包含刚刚读取的令牌的类型。它可以是TT_EOFTT_EOLTT_NUMBERTT_WORD。然而,对于一个带引号的字符串令牌,其值是引号字符的ASCII值。此外,如果令牌是一个普通的字符,如‘!’,没有属性,那么ttype将被填充为该字符的ASCII值。

Next, we’re using sval field to get the token, only if it’s a TT_WORD, that is, a word token. But, if we’re dealing with a quoted string token – say “lazy” – then this field contains the body of the string.

接下来,我们使用sval字段来获取标记,只有当它是TT_WORD,也就是一个词的标记。但是,如果我们处理的是一个带引号的字符串标记–比如说“lazy” – 那么这个字段就包含了字符串的主体。

Last, we’ve used the nval field to get the token, only if it’s a number token, using TT_NUMBER.

最后,我们使用nval字段来获取令牌,只有当它是数字令牌时,才使用TT_NUMBER

4. Custom Configuration

4.自定义配置

Here, we’ll change the default configuration and create another example.

在这里,我们将改变默认配置并创建另一个例子。

First, we’re going to set some extra word characters using the wordChars(int low, int hi) method. Then, we’ll make the comment character (‘/’) an ordinary one and promote ‘#’ as the new comment character.

首先,我们要使用wordChars(int low, int hi)方法设置一些额外的单词字符。然后,我们要把注释字符(’/’)变成普通字符,并把‘#’提升为新的注释字符。

Finally, we’ll consider the end of the line as a token character with the help of the eolIsSignificant(boolean flag) method.

最后,我们将在eolIsSignificant(boolean flag)方法的帮助下,将行尾视为一个标记字符

We only need to call these methods on the streamTokenizer object:

我们只需要在streamTokenizer对象上调用这些方法。

public static List<Object> streamTokenizerWithCustomConfiguration(Reader reader) throws IOException {
    StreamTokenizer streamTokenizer = new StreamTokenizer(reader);
    List<Object> tokens = new ArrayList<Object>();

    streamTokenizer.wordChars('!', '-');
    streamTokenizer.ordinaryChar('/');
    streamTokenizer.commentChar('#');
    streamTokenizer.eolIsSignificant(true);

    // same as before

    return tokens;
}

And here we have a new output:

这里我们有一个新的输出。

// same output as earlier
Word: "lazy"
Word: dog!
Ordinary char: 

Ordinary char: 

Ordinary char: /
Ordinary char: /
Word: test2

Note that the double quotes became part of the token, the newline character is not a whitespace character anymore, but an ordinary character, and therefore a single-character token.

注意,双引号成为标记的一部分,换行符不再是一个空白字符,而是一个普通字符,因此是一个单字符的标记。

Also, the characters following the ‘#’ character are now skipped and the ‘/’ is an ordinary character.

另外,’#’字符后面的字符现在被跳过,’/’是一个普通字符。

We could also change the quote character with the quoteChar(int ch) method or even the whitespace characters by calling whitespaceChars(int low, int hi) method. Thus, further customizations can be made calling StreamTokenizer‘s methods in different combinations.

我们还可以quoteChar(int ch)方法改变引号字符,甚至通过调用whitespaceChars(int low, int hi)方法改变空白字符。因此,可以通过调用StreamTokenizer的方法以不同的组合进行进一步的定制

5. Conclusion

5.总结

In this tutorial, we’ve seen how to parse a stream of characters into tokens using the StreamTokenizer class. We’ve learned about the default mechanism and created an example with the default configuration.

在本教程中,我们已经看到了如何使用StreamTokenizer类将字符流解析成标记。我们已经了解了默认的机制,并且用默认配置创建了一个例子。

Finally, we’ve changed the default parameters and we’ve noticed how flexible the StreamTokenizer class is.

最后,我们改变了默认参数,我们注意到StreamTokenizer类是多么灵活。

As usual, the code can be found over on GitHub.

像往常一样,代码可以在GitHub上找到over