A Full-Coverage Two-Level URL Duplication Checking Method for a High-Speed Parallel Web Crawler

Younus Arjumand; Whang Kyu-Young; Kwon Hyuk-Yoon; Yeo Yeon-Mi

首页> 外文期刊>Journal of information science and engineering >A Full-Coverage Two-Level URL Duplication Checking Method for a High-Speed Parallel Web Crawler

【24h】

A Full-Coverage Two-Level URL Duplication Checking Method for a High-Speed Parallel Web Crawler

机译：高速并行Web爬网程序的全覆盖两级URL重复检查方法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

For efficient large-scale Web crawlers, URL duplication checking is an important technique since it is a significant bottleneck. In this paper, we propose a new URL duplication checking technique for a parallel Web crawler; we call it full-coverage two level URL duplication checking (full-coverage-2L-UDC). Full-coverage-2L-UDC provides efficient URL duplication checking while ensuring maximum coverage. First, we propose two-level URL duplication checking (2L-UDC). It provides efficiency in URL duplication checking by communicating at the Web site level rather than at the Web page level. Second, we present a solution for the so-called coverage problem, which is directly related to the recall of the search engine. It is the first solution for the coverage problem in the centralized parallel architecture. Third, we propose an architecture, FC2-LUDCbot, for a centralized parallel crawler using full-coverage-2L-UDC. We build a seven-agent FC2L-UDCbot for extensive experiments. We show that the crawling speed of FC2L-UDCbot is approximately proportional to the number of agents (Le., FC2L-UDCbot is faster than a single-machine crawler by 6.9 times). Full-coverage-2L-UDC allows FC2L-UDCbot to be scalable to the number of agents since it effectively deals with the overheads incurred in a parallel environment. Through an in-depth analysis, we construct a cost model for estimating the crawling speed of a scaled-up crawler. Using the model, we show that FC2L-UDCbot can crawl Google-scale Web pages within several days using dozens of agents.

机译：对于高效的大型Web搜寻器，URL复制检查是一项重要技术，因为它是一个重要的瓶颈。在本文中，我们为并行Web搜寻器提出了一种新的URL重复检查技术。我们将其称为完全覆盖二级URL重复检查（full-coverage-2L-UDC）。 Full-coverage-2L-UDC在确保最大覆盖范围的同时，提供了有效的URL重复检查。首先，我们提出两级URL重复检查（2L-UDC）。它通过在网站级别而不是在网页级别进行通信来提供URL重复检查的效率。其次，我们为所谓的覆盖率问题提供了一种解决方案，该解决方案与搜索引擎的召回直接相关。它是集中式并行体系结构中覆盖问题的第一个解决方案。第三，我们为使用完全覆盖2L-UDC的集中式并行搜寻器提出了一种架构FC2-LUDCbot。我们为广泛的实验构建了一个具有七个代理的FC2L-UDCbot。我们显示，FC2L-UDCbot的爬网速度大约与代理数量成正比（例如，FC2L-UDCbot比单机爬网器快6.9倍）。 Full-coverage-2L-UDC允许FC2L-UDCbot扩展到代理数量，因为它可以有效处理并行环境中产生的开销。通过深入分析，我们构建了一个成本模型，用于估算规模化爬虫的爬网速度。使用该模型，我们表明FC2L-UDCbot可以使用数十种代理在几天之内抓取Google规模的网页。

著录项

来源
《Journal of information science and engineering》 |2015年第3期|839-860|共22页
作者
Younus Arjumand; Whang Kyu-Young; Kwon Hyuk-Yoon; Yeo Yeon-Mi;
展开▼
作者单位

Korea Adv Inst Sci & Technol, Dept Comp Sci, Taejon 305701, South Korea;

Korea Adv Inst Sci & Technol, Dept Comp Sci, Taejon 305701, South Korea;

Korea Adv Inst Sci & Technol, Dept Comp Sci, Taejon 305701, South Korea;

Korea Adv Inst Sci & Technol, Dept Comp Sci, Taejon 305701, South Korea;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
centralized parallel crawler; two-level URL duplication checking; coverage problem; high-speed parallel Web crawler; cost model;

机译：集中式并行搜寻器;两级URL重复检查;覆盖问题;高速并行Web搜寻器;成本模型;

相似文献

外文文献
中文文献
专利

1. A Space-saving URL Duplication Removal Method for Web Crawler [J] . Yingjun Wu, Han Huang, Xianzheng Zhou, Journal of information and computational science . 2012,第5期

机译：用于Web爬网程序的节省空间的URL重复删除方法
2. WebParF:A Web Partitioning Framework for Parallel Crawler [J] . Sonali Gupta, Komal Bhatia International Journal on Computer Science and Engineering . 2013,第8期

机译：WebParF：用于并行爬网程序的Web分区框架
3. High-speed 'home - Chris Burlace checks out a motorhome which set a new world speed record! [J] . Neil Birkitt Volkswagen Driver . 2008,第2期

机译：高速房车-克里斯·伯拉克（Chris Burlace）签下了创下新世界纪录的房车！
4. A dynamic URL assignment method for parallel web crawler [C] . Guerriero A., Ragni F., Martines C. 2010 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications . 2010

机译：并行Web搜寻器的动态URL分配方法
5. A Novel Technique for Spare Web Page Detection in Parallel Web Crawler [O] . Gaurav Kumar Srivastav, Irphan Ali 2015

机译：并行Web爬虫中备用Web页面检测的一种新技术
6. Methods to Model-Check Parallel Systems Software [R] . Matlin, O. S., McCune, W., Lusk, E. 2003

机译：模型检查并行系统软件的方法

A Full-Coverage Two-Level URL Duplication Checking Method for a High-Speed Parallel Web Crawler

摘要

著录项

相似文献

相关主题

期刊订阅