Introduction to HtmlUnit – HtmlUnit简介

最后修改: 2016年 12月 10日

中文/混合/英文(键盘快捷键:t)

1. Introduction

1.介绍

In this article, we will introduce HtmlUnit, a tool that allows us to, simply put, interact with and test an HTML site programmatically, using JAVA APIs.

在这篇文章中,我们将介绍HtmlUnit,这个工具允许我们,简单地说,使用JAVA APIs,以编程方式与HTML网站互动和测试。

2. About HtmlUnit

2.关于HtmlUnit

HtmlUnit is a GUI-less browser – a browser intended to be used programmatically and not directly by a user.

HtmlUnit是一个无GUI的浏览器–一个旨在通过编程而不是由用户直接使用的浏览器。

The browser supports JavaScript (via the Mozilla Rhino engine) and can be used even for websites with complex AJAX functionalities. All of this can be done simulating a typical GUI based browser like Chrome or Firefox.

该浏览器支持JavaScript(通过Mozilla Rhino引擎),甚至可用于具有复杂AJAX功能的网站。所有这些都可以模拟一个典型的基于GUI的浏览器,如Chrome或Firefox。

The name HtmlUnit could lead you to think that it’s a testing framework, but while it can definitely be used for testing, it can do so much more than that.

HtmlUnit这个名字可能会让你认为它是一个测试框架,但它绝对可以用于测试,但它能做的远远不止这些。

It has also been integrated into Spring 4 and can be used seamlessly together with Spring MVC Test framework.

它还被集成到Spring 4中,并可与Spring MVC测试框架一起无缝使用。

3. Download and Maven Dependency

3.下载和Maven的依赖性

HtmlUnit can be downloaded from SourceForge or from the official website. Also, you can include it in your building tool (like Maven or Gradle, among others) as you can see here. For instance, this is the Maven dependency you can currently include in your project:

HtmlUnit可以从SourceForge官方网站上下载。另外,你也可以把它包含在你的构建工具(如Maven或Gradle等)中,你可以在这里看到。例如,这是您目前可以在您的项目中包含的Maven依赖项。

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.23</version>
</dependency>

The newest version can be found here.

最新的版本可以在这里找到。

4. Web Testing

4.网络测试

There are many ways in which you can test a web application – most of which we covered here on the site at one point or another.

有许多方法可以测试一个网络应用程序–其中大部分我们在网站上都有涉及。

With HtmlUnit you can directly parse the HTML of a site, interact with it just as a normal user would from the browser, check JavaScript and CSS syntax, submit forms and parse the responses to see the content of its HTML elements. All of it, using pure Java code.

通过HtmlUnit,你可以直接解析一个网站的HTML,就像普通用户在浏览器中一样与之交互,检查JavaScript和CSS语法,提交表单并解析响应以查看其HTML元素的内容。所有这些,都是使用纯Java代码。

Let’s start with a simple test: create a WebClient and get the first page of the navigation of www.baeldung.com:

让我们从一个简单的测试开始:创建一个WebClient并获得www.baeldung.com导航的第一页。

private WebClient webClient;

@Before
public void init() throws Exception {
    webClient = new WebClient();
}

@After
public void close() throws Exception {
    webClient.close();
}

@Test
public void givenAClient_whenEnteringBaeldung_thenPageTitleIsOk()
  throws Exception {
    HtmlPage page = webClient.getPage("/");
    
    Assert.assertEquals(
      "Baeldung | Java, Spring and Web Development tutorials",
        page.getTitleText());
}

You can see some warnings or errors when running that test if our website has JavaScript or CSS problems. You should correct them.

如果我们的网站有JavaScript或CSS问题,你在运行该测试时可以看到一些警告或错误。你应该纠正它们。

Sometimes, if you know what you’re doing (for instance, if you see that the only errors you have are from third-party JavaScript libraries that you should not modify) you can prevent these errors from making your test fail, calling setThrowExceptionOnScriptError with false:

有时,如果你知道你在做什么(例如,如果你看到你唯一的错误是来自你不应该修改的第三方JavaScript库),你可以防止这些错误使你的测试失败,用setThrowExceptionOnScriptError调用false

@Test
public void givenAClient_whenEnteringBaeldung_thenPageTitleIsCorrect()
  throws Exception {
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    HtmlPage page = webClient.getPage("/");
    
    Assert.assertEquals(
      "Baeldung | Java, Spring and Web Development tutorials",
        page.getTitleText());
}

5. Web Scraping

5.网络刮削

You don’t need to use HtmlUnit just for your own websites. It’s a browser, after all: you can use it to navigate through any web you like, send and retrieve data as needed.

你不需要只为你自己的网站使用HtmlUnit。它毕竟是一个浏览器:你可以用它来浏览任何你喜欢的网页,根据需要发送和检索数据。

Fetching, parsing, storing and analyzing data from websites is the process known as web scraping and HtmlUnit can help you with the fetching and parsing parts.

从网站获取、解析、存储和分析数据的过程被称为网络刮削,HtmlUnit可以帮助你完成获取和解析部分。

The previous example shows how we can enter any website and navigate through it, retrieving all the info we want.

前面的例子显示了我们如何进入任何网站并浏览它,检索我们想要的所有信息。

For instance, let’s go to Baeldung’s full archive of articles, navigate to the latest article and retrieve its title (first <h1> tag). For our test, that will be enough; but, if we wanted to store more info, we could, for instance, retrieve the headings (all <h2> tags) as well, thus having a basic idea of what the article is about.

例如,让我们进入Baeldung的全部文章档案,导航到最新的文章并检索其标题(第一个<h1>标签)。对于我们的测试,这就足够了;但是,如果我们想存储更多的信息,我们可以,例如,检索标题(所有<h2>标签),从而对文章的内容有一个基本的概念。

It’s easy to get elements by their ID, but generally, if you need to find an element it’s more convenient to use XPath syntax. HtmlUnit allows us to use it, so we will.

通过ID获取元素很容易,但一般来说,如果你需要找到一个元素,使用XPath语法更方便。HtmlUnit允许我们使用它,所以我们会使用。

@Test
public void givenBaeldungArchive_whenRetrievingArticle_thenHasH1() 
  throws Exception {
    webClient.getOptions().setCssEnabled(false);
    webClient.getOptions().setJavaScriptEnabled(false);

    String url = "/full_archive";
    HtmlPage page = webClient.getPage(url);
    String xpath = "(//ul[@class='car-monthlisting']/li)[1]/a";
    HtmlAnchor latestPostLink 
      = (HtmlAnchor) page.getByXPath(xpath).get(0);
    HtmlPage postPage = latestPostLink.click();

    List<HtmlHeading1> h1  
      = (List<HtmlHeading1>) postPage.getByXPath("//h1");
 
    Assert.assertTrue(h1.size() > 0);
}

First note how – in this case, we are not interested in CSS nor JavaScript and just want to parse the HTML layout, so we turned CSS and JavaScript off.

首先要注意的是–在这种情况下,我们对CSS和JavaScript都不感兴趣,只想解析HTML布局,所以我们关闭了CSS和JavaScript。

In a real web scraping, you could take for example the h1 and h2 titles, and the outcome would be something like this:

在真正的网络搜刮中,你可以以h1h2标题为例,结果会是这样的。

Java Web Weekly, Issue 135
1. Spring and Java
2. Technical and Musings
3. Comics
4. Pick of the Week

You can check that the retrieved info corresponds to the latest article in Baeldung indeed:

你可以检查检索到的信息是否确实与Baeldung的最新文章相一致。

latestBaeldung

6. What About AJAX?

6.关于AJAX?

AJAX functionalities can be a problem because HtmlUnit will usually retrieve the page before the AJAX calls have finished. Many times you need them to finish to properly test your website or to retrieve the data you want. There are some ways to deal with them:

AJAX功能可能是一个问题,因为HtmlUnit通常会在AJAX调用完成之前检索到页面。很多时候,你需要它们完成来正确测试你的网站或检索你想要的数据。有一些方法来处理它们。

  • You can use webClient.setAjaxController(new NicelyResynchronizingAjaxController()). This resynchronizes calls performed from the main thread and these calls are performed synchronously to ensure that there is a stable state to test.
  • When entering a page of a web application, you can wait for some seconds so there is enough time to let AJAX calls finish. To achieve this, you can use webClient.waitForBackgroundJavaScript(MILLIS) or webClient.waitForBackgroundJavaScriptStartingBefore(MILLIS). You should call them after retrieving the page, but before working with it.
  • You can wait until some expected condition related to the execution of the AJAX call is met. For instance:
for (int i = 0; i < 20; i++) {
    if (condition_to_happen_after_js_execution) {
        break;
    }
    synchronized (page) {
        page.wait(500);
    }
}
  • Instead of creating a new WebClient(), that defaults to the best-supported web browser, try other browsers since they might work better with your JavaScript or AJAX calls. For instance, this will create a webClient that uses a Chrome browser:
WebClient webClient = new WebClient(BrowserVersion.CHROME);

7. An Example With Spring

7.一个使用Spring的例子

If we’re testing our own Spring application, then things get a little bit easier – we no longer need a running server.

如果我们要测试自己的Spring应用程序,那么事情就会变得简单一点–我们不再需要一个正在运行的服务器

Let’s implement a very simple example app: just a controller with a method that receives a text, and a single HTML page with a form. The user can input a text into the form, submit the form, and the text will be shown below that form.

让我们来实现一个非常简单的应用实例:只有一个带有接收文本的方法的控制器,以及一个带有表单的单一HTML页面。用户可以在表单中输入一个文本,提交表单,文本将显示在该表单下方。

In this case, we’ll use a Thymeleaf template for that HTML page (you can see a complete Thymeleaf example here):

在这种情况下,我们将为该HTML页面使用Thymeleaf模板(你可以看到一个完整的Thymeleaf示例这里)。

@RunWith(SpringJUnit4ClassRunner.class)
@WebAppConfiguration
@ContextConfiguration(classes = { TestConfig.class })
public class HtmlUnitAndSpringTest {

    @Autowired
    private WebApplicationContext wac;

    private WebClient webClient;

    @Before
    public void setup() {
        webClient = MockMvcWebClientBuilder
          .webAppContextSetup(wac).build();
    }

    @Test
    public void givenAMessage_whenSent_thenItShows() throws Exception {
        String text = "Hello world!";
        HtmlPage page;

        String url = "http://localhost/message/showForm";
        page = webClient.getPage(url);
            
        HtmlTextInput messageText = page.getHtmlElementById("message");
        messageText.setValueAttribute(text);

        HtmlForm form = page.getForms().get(0);
        HtmlSubmitInput submit = form.getOneHtmlElementByAttribute(
          "input", "type", "submit");
        HtmlPage newPage = submit.click();

        String receivedText = newPage.getHtmlElementById("received")
            .getTextContent();

        Assert.assertEquals(receivedText, text);     
    }
}

The key here is building the WebClient object using MockMvcWebClientBuilder from the WebApplicationContext. With the WebClient, we can get the first page of the navigation (notice how it’s served by localhost), and start browsing from there.

这里的关键是使用MockMvcWebClientBuilderWebApplicationContext构建WebClient对象。通过WebClient,我们可以获得导航的第一页(注意它是由localhost提供的),并从那里开始浏览。

As you can see, the test parses the form enters a message (in a field with ID “message”), submits the form and, on the new page, it asserts that the received text (field with ID “received”) is the same as the text we submitted.

正如你所看到的,测试解析了表单,输入了一条信息(在ID为 “message “的字段中),提交了表单,在新的页面上,它断言收到的文本(ID为 “received “的字段)与我们提交的文本相同。

8. Conclusion

8. 结论

HtmlUnit is a great tool that allows you to test your web applications easily, filling forms fields and submitting them just as if you were using the web on a browser.

HtmlUnit是一个伟大的工具,它允许你轻松地测试你的网络应用,填写表格字段并提交,就像你在浏览器上使用网络一样。

It integrates seamlessly with Spring 4, and together with Spring MVC Test framework they give you a very powerful environment to make integration tests of all your pages even without a web server.

它与Spring 4无缝集成,与Spring MVC测试框架一起为你提供了一个非常强大的环境,即使没有Web服务器也可以对你的所有页面进行集成测试。

Also, using HtmlUnit you can automate any task related to web browsing, such as fetching, parsing, storing and analyzing data (web scraping).

同时,使用HtmlUnit,你可以自动化任何与网络浏览有关的任务,如获取、解析、存储和分析数据(网络刮擦)。

You can get the code over on Github.

你可以在Github上获得代码over