Normalize a URL in Java – 用 Java 规范化 URL

最后修改: 2024年 2月 5日

中文/混合/英文(键盘快捷键:t)

1. Introduction

1.导言

Uniform Resource Locators (URLs) are a significant part of web development as they help locate and get resources on the Internet. Yet, URLs may be inconsistent or formatted incorrectly; this could cause problems with processing and obtaining the desired materials.

统一资源定位符 (URL) 是网络开发的重要组成部分,因为它们有助于在互联网上定位和获取资源。然而,URL 可能不一致或格式不正确;这可能会在处理和获取所需资料时造成问题。

URL normalization transforms the given piece of data to a canonical form, ensuring consistency and facilitating operability.

URL 规范化将给定的数据转换为规范形式,确保一致性并提高可操作性。

Throughout this tutorial, we’ll investigate different techniques to normalize a URL in Java.

通过本教程,我们将研究在 Java 中规范化 URL 的不同技术。

2. Manual Normalization

2.手动标准化

Performing manual normalization involves applying custom logic to standardize the URLs. This process includes removing extraneous elements, such as unnecessary query parameters and fragment identifiers, to distill the URL down to its essential core. Suppose we have the following URL:

执行手动规范化涉及应用自定义逻辑来规范 URL。此过程包括删除无关元素(如不必要的查询参数和片段标识符),以将 URL 简化为其基本核心:

https://www.example.com:8080/path/to/resource?param1=value1&param2=value2#fragment

https://www.example.com:8080/path/to/resource?param1=value1&param2=value2#fragment

The normalized URL should be as follows:

规范化 URL 应如下所示:

https://www.example.com:8080/path/to/resource

https://www.example.com:8080/path/to/resource

Note that we’re considering anything after “?” as unnecessary, as we’re only interested in grouping by resource. But that’ll vary depending on the use case.

请注意,我们认为”? “后面的内容都是不必要的,因为我们只对按资源分组感兴趣。但这也会因使用情况而异。

3. Utilizing Apache Commons Validator

3.利用 Apache Commons 验证器

The UrlValidator class in the Apache Commons Validator library is a convenient validation method for validating and normalizing URLs. First, we should ensure that our project includes the Apache Commons Validator dependency as follows:

Apache Commons Validator 库中的UrlValidator类是一种用于验证和规范化 URL 的便捷验证方法。首先,我们应确保项目包含 Apache Commons Validator 依赖关系,如下所示:

<dependency>
    <groupId>commons-validator</groupId>
    <artifactId>commons-validator</artifactId>
    <version>1.8.0</version>
    <scope>test</scope>
</dependency>

Now, we’re ready to implement a simple Java code example:

现在,我们准备实施一个简单的 Java 代码示例:

String originalUrl = "https://www.example.com:8080/path/to/resource?param1=value1&param2=value2#fragment";
String expectedNormalizedUrl = "https://www.example.com:8080/path/to/resource";

@Test
public void givenOriginalUrl_whenUsingApacheCommonsValidator_thenValidatedAndMaybeManuallyNormalized() {
    UrlValidator urlValidator = new UrlValidator();
    if (urlValidator.isValid(originalUrl)) {
        String normalizedUrl = originalUrl.split("\\?")[0];
        assertEquals(expectedNormalizedUrl, manuallyNormalizedUrl);
    } else {
        fail(originalUrl);
    }
}

Here, we start by instantiating an object from the UrlValidator. Later, we use the isValid() method to determine whether the original URL compiles with the validation rules that were previously mentioned.

在这里,我们首先从 UrlValidator 中实例化一个对象。随后,我们使用 isValid() 方法来确定原始 URL 是否符合前面提到的验证规则。

If the URL turns out to be legitimate, we standardize it by hand and remove query parameters and fragments, especially everything after ‘?’. Finally, we use the assertEquals() method to validate the equivalence of expectedNormalizedUrl and normalizedUrl.

如果 URL 被证明是合法的,我们将对其进行手工标准化,并删除查询参数和片段,尤其是’?’后面的所有内容。最后,我们使用 assertEquals() 方法来验证 expectedNormalizedUrlnormalizedUrl. 的等价性。

4. Utilizing Java’s URI Class

4.利用 Java 的 URI 类

Establishing a Java URI class in the java.net package provides other features for managing URIs, including normalization. Let’s see a simple example:

在 java.net 包中建立 Java URI 类可提供管理 URI 的其他功能,包括规范化。让我们来看一个简单的例子:

@Test
public void givenOriginalUrl_whenUsingJavaURIClass_thenNormalizedUrl() throws URISyntaxException {
    URI uri = new URI(originalUrl);
    URI normalizedUri = new URI(uri.getScheme(), uri.getAuthority(), uri.getPath(), null, null);
    String normalizedUrl = normalizedUri.toString();
    assertEquals(expectedNormalizedUrl, normalizedUrl);
}

Within this test, we pass the originalUrl to the URI object, and a normalized URI is derived by extracting and reassembling specific components such as scheme, authority, and path.

在该测试中,我们将 originalUrl 传递给 URI 对象,然后通过提取和重新组合特定组件(如方案、权限和路径),得出规范化的 URI

5. Using Regular Expressions

5.使用正则表达式

Regex is one very useful mechanism for the URL normalization in Java. They enable you to specify many patterns and transformations that match the URLs and change them based on your needs. Here’s a simple code example:

Regex 是 Java 中一种非常有用的 URL 规范化机制。通过它,您可以指定许多与 URL 匹配的模式和转换,并根据需要更改它们。下面是一个简单的代码示例:

@Test
public void givenOriginalUrl_whenUsingRegularExpression_thenNormalizedUrl() throws URISyntaxException, UnsupportedEncodingException {
    String regex = "^(https?://[^/]+/[^?#]+)";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(originalUrl);

    if (matcher.find()) {
        String normalizedUrl = matcher.group(1);
        assertEquals(expectedNormalizedUrl, normalizedUrl);
    } else {
        fail(originalUrl);
    }
}

In the above code example, we first create a regex pattern that matches the scheme, domain, and path components of the URL. Then, we turn this pattern into a Pattern object representing a regular expression. Also, we use a Matcher to match the original URL against this given pattern.

在上述代码示例中,我们首先创建一个 regex pattern 来匹配 URL 的 scheme、domain 和 path 组件。然后,我们将此模式转化为代表正则表达式的 Pattern 对象。此外,我们还使用 Matcher 将原始 URL 与给定模式相匹配。

Moreover, we utilize the matcher.find() method to find the next subsequence of the input sequence that matches the pattern defined by the regex. If the matcher.find() method returns true, the matcher.group(1) fetches out the substring that matches the regex. In this case, it specifically captures the content of the first-capturing group in regex (denoted by parentheses), which is thought to be a normalized URL.

此外,我们利用 matcher.find() 方法查找输入序列中与 regex 所定义的模式匹配的下一个子序列。如果 matcher.find() 方法返回 true,matcher.group(1) 将获取与 regex 匹配的子串。在这种情况下,它会专门捕获 regex 中第一个捕获组的内容(用括号表示),这被认为是一个规范化的 URL。

6. Conclusion

6.结论

In conclusion, we explored several ways, such as manual normalization, the Apache Commons Validator library, Java’s URI class, and regular expressions for URL normalization in Java.

最后,我们探索了几种方法,如手动规范化、Apache Commons Validator 库、Java 的 URI 类和正则表达式在 Java 中实现 URL 规范化。

As usual, the accompanying source code can be found over on GitHub.

与往常一样,您可以在 GitHub 上找到随附的源代码