1. Introduction
1.绪论
We see web crawlers in use every time we use our favorite search engine. They’re also commonly used to scrape and analyze data from websites.
我们每次使用我们最喜欢的搜索引擎时都会看到网络爬虫在使用。它们也通常被用来从网站上搜刮和分析数据。
In this tutorial, we’re going to learn how to use crawler4j to set up and run our own web crawlers. crawler4j is an open source Java project that allows us to do this easily.
在本教程中,我们将学习如何使用crawler4j来设置和运行我们自己的网络爬虫。crawler4j是一个开源的Java项目,可以让我们轻松做到这一点。
2. Setup
2.设置
Let’s use Maven Central to find the most recent version and bring in the Maven dependency:
让我们使用Maven Central来查找最新的版本并引入Maven依赖。
<dependency>
<groupId>edu.uci.ics</groupId>
<artifactId>crawler4j</artifactId>
<version>4.4.0</version>
</dependency>
3. Creating Crawlers
3.创建爬虫
3.1. Simple HTML Crawler
3.1 简单的HTML爬虫
We’re going to start by creating a basic crawler that crawls the HTML pages on https://baeldung.com.
我们将首先创建一个基本的爬虫,爬取https://baeldung.com上的HTML页面。
Let’s create our crawler by extending WebCrawler in our crawler class and defining a pattern to exclude certain file types:
让我们通过在爬虫类中扩展WebCrawler来创建我们的爬虫,并定义一个模式来排除某些文件类型。
public class HtmlCrawler extends WebCrawler {
private final static Pattern EXCLUSIONS
= Pattern.compile(".*(\\.(css|js|xml|gif|jpg|png|mp3|mp4|zip|gz|pdf))$");
// more code
}
In each crawler class, we must override and implement two methods: shouldVisit and visit.
在每个爬虫类中,我们必须覆盖并实现两个方法。shouldVisit和visit。
Let’s create our shouldVisit method now using the EXCLUSIONS pattern we created:
现在让我们使用我们创建的EXCLUSIONS模式创建我们的shouldVisit方法。
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String urlString = url.getURL().toLowerCase();
return !EXCLUSIONS.matcher(urlString).matches()
&& urlString.startsWith("https://www.baeldung.com/");
}
Then, we can do our processing for visited pages in the visit method:
然后,我们可以在visit方法中对访问过的页面进行处理。
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String title = htmlParseData.getTitle();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
Set<WebURL> links = htmlParseData.getOutgoingUrls();
// do something with the collected data
}
}
Once we have our crawler written, we need to configure and run it:
一旦我们写好了我们的爬虫,我们就需要配置和运行它。
File crawlStorage = new File("src/test/resources/crawler4j");
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorage.getAbsolutePath());
int numCrawlers = 12;
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer= new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed("https://www.baeldung.com/");
CrawlController.WebCrawlerFactory<HtmlCrawler> factory = HtmlCrawler::new;
controller.start(factory, numCrawlers);
We configured a temporary storage directory, specified the number of crawling threads, and seeded the crawler with a starting URL.
我们配置了一个临时存储目录,指定了爬行线程的数量,并向爬行者提供了一个起始URL。
We should also note that the CrawlController.start() method is a blocking operation. Any code after that call will only execute after the crawler has finished running.
我们还应该注意,CrawlController.start()方法是一个阻塞操作。该调用之后的任何代码都将在爬虫完成运行后才执行。
3.2. ImageCrawler
3.2.ImageCrawler
By default, crawler4j doesn’t crawl binary data. In this next example, we’ll turn on that functionality and crawl all the JPEGs on Baeldung.
默认情况下,crawler4j并不抓取二进制数据。在接下来的这个例子中,我们将打开该功能,抓取Baeldung上的所有JPEG文件。
Let’s start by defining the ImageCrawler class with a constructor that takes a directory for saving images:
让我们从定义ImageCrawler类开始,它有一个构造函数,需要一个用于保存图片的目录。
public class ImageCrawler extends WebCrawler {
private final static Pattern EXCLUSIONS
= Pattern.compile(".*(\\.(css|js|xml|gif|png|mp3|mp4|zip|gz|pdf))$");
private static final Pattern IMG_PATTERNS = Pattern.compile(".*(\\.(jpg|jpeg))$");
private File saveDir;
public ImageCrawler(File saveDir) {
this.saveDir = saveDir;
}
// more code
}
Next, let’s implement the shouldVisit method:
接下来,我们来实现shouldVisit方法。
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String urlString = url.getURL().toLowerCase();
if (EXCLUSIONS.matcher(urlString).matches()) {
return false;
}
if (IMG_PATTERNS.matcher(urlString).matches()
|| urlString.startsWith("https://www.baeldung.com/")) {
return true;
}
return false;
}
Now, we’re ready to implement the visit method:
现在,我们准备实现visit方法。
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
if (IMG_PATTERNS.matcher(url).matches()
&& page.getParseData() instanceof BinaryParseData) {
String extension = url.substring(url.lastIndexOf("."));
int contentLength = page.getContentData().length;
// write the content data to a file in the save directory
}
}
Running our ImageCrawler is similar to running the HttpCrawler, but we need to configure it to include binary content:
运行我们的ImageCrawler与运行HttpCrawler相似,但我们需要将其配置为包括二进制内容。
CrawlConfig config = new CrawlConfig();
config.setIncludeBinaryContentInCrawling(true);
// ... same as before
CrawlController.WebCrawlerFactory<ImageCrawler> factory = () -> new ImageCrawler(saveDir);
controller.start(factory, numCrawlers);
3.3. Collecting Data
3.3.收集数据
Now that we’ve looked at a couple of basic examples, let’s expand on our HtmlCrawler to collect some basic statistics during our crawl.
现在我们已经看了几个基本的例子,让我们在我们的HtmlCrawler上进行扩展,在抓取过程中收集一些基本的统计数据。
First, let’s define a simple class to hold a couple of statistics:
首先,让我们定义一个简单的类来保存几个统计数据。
public class CrawlerStatistics {
private int processedPageCount = 0;
private int totalLinksCount = 0;
public void incrementProcessedPageCount() {
processedPageCount++;
}
public void incrementTotalLinksCount(int linksCount) {
totalLinksCount += linksCount;
}
// standard getters
}
Next, let’s modify our HtmlCrawler to accept a CrawlerStatistics instance via a constructor:
接下来,让我们修改我们的HtmlCrawler,通过构造函数接受一个CrawlerStatistics实例。
private CrawlerStatistics stats;
public HtmlCrawler(CrawlerStatistics stats) {
this.stats = stats;
}
With our new CrawlerStatistics object, let’s modify the visit method to collect what we want:
有了新的CrawlerStatistics对象,让我们修改visit方法来收集我们想要的东西。
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
stats.incrementProcessedPageCount();
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String title = htmlParseData.getTitle();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
Set<WebURL> links = htmlParseData.getOutgoingUrls();
stats.incrementTotalLinksCount(links.size());
// do something with collected data
}
}
Now, let’s head back to our controller and provide the HtmlCrawler with an instance of CrawlerStatistics:
现在,让我们回到控制器,为HtmlCrawler提供一个CrawlerStatistics的实例。
CrawlerStatistics stats = new CrawlerStatistics();
CrawlController.WebCrawlerFactory<HtmlCrawler> factory = () -> new HtmlCrawler(stats);
3.4. Multiple Crawlers
3.4.多个爬虫
Building on our previous examples, let’s now have a look at how we can run multiple crawlers from the same controller.
在之前的例子基础上,我们现在来看看我们如何从同一个控制器中运行多个爬虫。
It’s recommended that each crawler use its own temporary storage directory, so we need to create separate configurations for each one we’ll be running.
建议每个爬虫使用自己的临时存储目录,因此我们需要为将要运行的每个爬虫创建单独的配置。
The CrawlControllers can share a single RobotstxtServer, but otherwise, we basically need a copy of everything.
CrawlControllers可以共享一个RobotstxtServer,但除此之外,我们基本上需要所有东西的副本。
So far, we’ve used the CrawlController.start method to run our crawlers and noted that it’s a blocking method. To run multiples, we’ll use CrawlerControlller.startNonBlocking in conjunction with CrawlController.waitUntilFinish.
到目前为止,我们已经使用CrawlController.start方法来运行我们的爬虫,并注意到这是一个阻塞方法。为了运行多个爬虫,我们将使用CrawlerControlller.startNonBlocking与CrawlController.waitUntilFinish相结合。
Now, let’s create a controller to run HtmlCrawler and ImageCrawler concurrently:
现在,让我们创建一个控制器来同时运行HtmlCrawler和ImageCrawler。
File crawlStorageBase = new File("src/test/resources/crawler4j");
CrawlConfig htmlConfig = new CrawlConfig();
CrawlConfig imageConfig = new CrawlConfig();
// Configure storage folders and other configurations
PageFetcher pageFetcherHtml = new PageFetcher(htmlConfig);
PageFetcher pageFetcherImage = new PageFetcher(imageConfig);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcherHtml);
CrawlController htmlController
= new CrawlController(htmlConfig, pageFetcherHtml, robotstxtServer);
CrawlController imageController
= new CrawlController(imageConfig, pageFetcherImage, robotstxtServer);
// add seed URLs
CrawlerStatistics stats = new CrawlerStatistics();
CrawlController.WebCrawlerFactory<HtmlCrawler> htmlFactory = () -> new HtmlCrawler(stats);
File saveDir = new File("src/test/resources/crawler4j");
CrawlController.WebCrawlerFactory<ImageCrawler> imageFactory
= () -> new ImageCrawler(saveDir);
imageController.startNonBlocking(imageFactory, 7);
htmlController.startNonBlocking(htmlFactory, 10);
htmlController.waitUntilFinish();
imageController.waitUntilFinish();
4. Configuration
4.配置
We’ve already seen some of what we can configure. Now, let’s go over some other common settings.
我们已经看到了一些我们可以配置的东西。现在,让我们来看看其他一些常见的设置。
Settings are applied to the CrawlConfig instance we specify in our controller.
设置被应用于我们在控制器中指定的CrawlConfig实例。
4.1. Limiting Crawl Depth
4.1.限制爬行深度
By default, our crawlers will crawl as deep as they can. To limit how deep they’ll go, we can set the crawl depth:
默认情况下,我们的爬虫会爬得越深越好。为了限制它们的深度,我们可以设置爬行深度。
crawlConfig.setMaxDepthOfCrawling(2);
Seed URLs are considered to be at depth 0, so a crawl depth of 2 will go two layers beyond the seed URL.
种子网址被认为是在深度0,因此抓取深度为2将超出种子网址两层。
4.2. Maximum Pages to Fetch
4.2.获取的最大页数
Another way to limit how many pages our crawlers will cover is to set the maximum number of pages to crawl:
限制我们的爬虫将覆盖多少页的另一种方法是设置最大的爬行页数。
crawlConfig.setMaxPagesToFetch(500);
4.3. Maximum Outgoing Links
4.3.最大出站链接
We can also limit the number of outgoing links followed off each page:
我们还可以限制每个页面的外发链接数量。
crawlConfig.setMaxOutgoingLinksToFollow(2000);
4.4. Politeness Delay
4.4. 礼貌性延迟
Since very efficient crawlers can easily be a strain on web servers, crawler4j has what it calls a politeness delay. By default, it’s set to 200 milliseconds. We can adjust this value if we need to:
由于非常高效的爬虫很容易对网络服务器造成压力,crawler4j有一个所谓的礼貌延迟。默认情况下,它被设置为200毫秒。如果需要,我们可以调整这个值。
crawlConfig.setPolitenessDelay(300);
4.5. Include Binary Content
4.5.包括二进制内容
We already used the option for including binary content with our ImageCrawler:
我们已经使用了包括二进制内容的选项与我们的ImageCrawler。
crawlConfig.setIncludeBinaryContentInCrawling(true);
4.6. Include HTTPS
4.6.包括HTTPS
By default, crawlers will include HTTPS pages, but we can turn that off:
默认情况下,爬虫将包括HTTPS页面,但我们可以将其关闭。
crawlConfig.setIncludeHttpsPages(false);
4.7. Resumable Crawling
4.7.可复述爬行
If we have a long-running crawler and we want it to resume automatically, we can set resumable crawling. Turning it on may cause it to run slower:
如果我们有一个长期运行的爬虫,并希望它自动恢复,我们可以设置可恢复爬行。开启它可能会导致它运行得更慢。
crawlConfig.setResumableCrawling(true);
4.8. User-Agent String
4.8.用户代理字符串
The default user-agent string for crawler4j is crawler4j. Let’s customize that:
crawler4j的默认用户代理字符串是crawler4j。让我们来定制一下。
crawlConfig.setUserAgentString("baeldung demo (https://github.com/yasserg/crawler4j/)");
We’ve just covered some of the basic configurations here. We can look at CrawConfig class if we’re interested in some of the more advanced or obscure configuration options.
我们在这里只是涵盖了一些基本的配置。如果我们对一些更高级或晦涩难懂的配置选项感兴趣,我们可以看看CrawConfig类。
5. Conclusion
5.总结
In this article, we’ve used crawler4j to create our own web crawlers. We started with two simple examples of crawling HTML and images. Then, we built on those examples to see how we can gather statistics and run multiple crawlers concurrently.
在这篇文章中,我们用crawler4j创建了自己的网络爬虫。我们从抓取HTML和图片的两个简单例子开始。然后,我们在这些例子的基础上,看看我们如何收集统计数据和并发地运行多个爬虫。
The full code examples are available over on GitHub.
完整的代码示例可在GitHub上获得,。