首页> 外文期刊>Journal of information science and engineering >A Full-Coverage Two-Level URL Duplication Checking Method for a High-Speed Parallel Web Crawler
【24h】

A Full-Coverage Two-Level URL Duplication Checking Method for a High-Speed Parallel Web Crawler

机译:高速并行Web爬网程序的全覆盖两级URL重复检查方法

获取原文
获取原文并翻译 | 示例
           

摘要

For efficient large-scale Web crawlers, URL duplication checking is an important technique since it is a significant bottleneck. In this paper, we propose a new URL duplication checking technique for a parallel Web crawler; we call it full-coverage two level URL duplication checking (full-coverage-2L-UDC). Full-coverage-2L-UDC provides efficient URL duplication checking while ensuring maximum coverage. First, we propose two-level URL duplication checking (2L-UDC). It provides efficiency in URL duplication checking by communicating at the Web site level rather than at the Web page level. Second, we present a solution for the so-called coverage problem, which is directly related to the recall of the search engine. It is the first solution for the coverage problem in the centralized parallel architecture. Third, we propose an architecture, FC2-LUDCbot, for a centralized parallel crawler using full-coverage-2L-UDC. We build a seven-agent FC2L-UDCbot for extensive experiments. We show that the crawling speed of FC2L-UDCbot is approximately proportional to the number of agents (Le., FC2L-UDCbot is faster than a single-machine crawler by 6.9 times). Full-coverage-2L-UDC allows FC2L-UDCbot to be scalable to the number of agents since it effectively deals with the overheads incurred in a parallel environment. Through an in-depth analysis, we construct a cost model for estimating the crawling speed of a scaled-up crawler. Using the model, we show that FC2L-UDCbot can crawl Google-scale Web pages within several days using dozens of agents.
机译:对于高效的大型Web搜寻器,URL复制检查是一项重要技术,因为它是一个重要的瓶颈。在本文中,我们为并行Web搜寻器提出了一种新的URL重复检查技术。我们将其称为完全覆盖二级URL重复检查(full-coverage-2L-UDC)。 Full-coverage-2L-UDC在确保最大覆盖范围的同时,提供了有效的URL重复检查。首先,我们提出两级URL重复检查(2L-UDC)。它通过在网站级别而不是在网页级别进行通信来提供URL重复检查的效率。其次,我们为所谓的覆盖率问题提供了一种解决方案,该解决方案与搜索引擎的召回直接相关。它是集中式并行体系结构中覆盖问题的第一个解决方案。第三,我们为使用完全覆盖2L-UDC的集中式并行搜寻器提出了一种架构FC2-LUDCbot。我们为广泛的实验构建了一个具有七个代理的FC2L-UDCbot。我们显示,FC2L-UDCbot的爬网速度大约与代理数量成正比(例如,FC2L-UDCbot比单机爬网器快6.9倍)。 Full-coverage-2L-UDC允许FC2L-UDCbot扩展到代理数量,因为它可以有效处理并行环境中产生的开销。通过深入分析,我们构建了一个成本模型,用于估算规模化爬虫的爬网速度。使用该模型,我们表明FC2L-UDCbot可以使用数十种代理在几天之内抓取Google规模的网页。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号