首页> 外文会议>International Conference on Future Computer and Communication;ICFCC >A Distributed Vertical Crawler Using Crawling-Period Based Strategy
【24h】

A Distributed Vertical Crawler Using Crawling-Period Based Strategy

机译:使用基于爬取时间的策略的分布式垂直爬虫

获取原文

摘要

Due to the explosive growth of the web pages, centralized crawlers are no longer sufficient to run on the web efficiently. There are many distributed crawlers in wide use; however, none of them is suitable for template-customized vertical crawling. In this paper, we present a distributed templatecustomized vertical crawler which is specially used for crawling Internet forums. The Client-Server architecture of the system and the function of every module are described in detail which can be extended to other fields easily. A crawling-period based distribution strategy is also proposed, with which the crawler manager can coordinate the quantity of crawling tasks and the resources of each crawler very well, and the crawler can process websites with different updating frequency flexibly. We also define a communication protocol between crawlers and crawler manager and describe how to solve the duplicated crawling problem in the distributed system. The performance of centralized vertical crawler and distributed vertical crawler are compared in the experiment. Experimental results demonstrate that the parallel operation of all the crawlers in the distributed system can greatly enhance the crawling efficiency.
机译:由于网页的爆炸性增长,集中式爬网程序不再足以有效地在Web上运行。有许多分布式爬虫被广泛使用。但是,它们都不适合模板定制的垂直爬网。在本文中,我们提出了一种分布式模板定制的垂直爬网程序,该爬网程序专门用于爬网Internet论坛。详细介绍了系统的客户端-服务器体系结构和每个模块的功能,可以轻松地将其扩展到其他领域。还提出了一种基于爬虫的分配策略,爬虫管理器可以很好地协调爬虫任务的数量和每个爬虫的资源,并且爬虫可以灵活地处理更新频率不同的网站。我们还定义了搜寻器和搜寻器管理器之间的通信协议,并描述了如何解决分布式系统中重复的搜寻问题。实验比较了集中式垂直爬虫和分布式垂直爬虫的性能。实验结果表明,分布式系统中所有爬虫的并行操作可以大大提高爬虫的效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号