首页> 外文期刊>IEICE Transactions on Information and Systems >Exploring Web Partition in DHT-Based Distributed Web Crawling
【24h】

Exploring Web Partition in DHT-Based Distributed Web Crawling

机译:在基于DHT的分布式Web爬网中探索Web分区

获取原文
获取原文并翻译 | 示例
       

摘要

The basic requirements of the distributed Web crawling systems are: short download time, low communication overhead and balanced load which largely depends on the systems' Web partition strategies. In this paper, we propose a DHT-based distributed Web crawling system and several DHT-based Web partition methods. First, a new system model based on a DHT method called the Content Addressable Network (CAN) is proposed. Second, based on this model, a network-distance-based Web partition is implemented to reduce the crawler-crawlee network distance in a fully distributed manner. Third, by utilizing the locality on the link space, we propose the concept of link-based Web partition to reduce the communication overhead of the system. This method not only reduces the number of inter-links to be exchanged among the crawlers but also reduces the cost of routing on the DHT overlay. In order to combine the benefits of the above two Web partition methods, we then propose 2 distributed multi-objective Web partition methods. Finally, all the methods we propose in this paper are compared with existing system models in the simulated experiments under different datasets and different system scales. In most cases, the new methods show their superiority.
机译:分布式Web爬网系统的基本要求是:下载时间短,通信开销低和负载均衡,这在很大程度上取决于系统的Web分区策略。在本文中,我们提出了一种基于DHT的分布式Web爬网系统以及几种基于DHT的Web分区方法。首先,提出了一种基于DHT方法的新系统模型,称为内容可寻址网络(Content Addressable Network,CAN)。其次,基于该模型,实现了基于网络距离的Web分区,以完全分布式的方式减少了爬虫爬网的网络距离。第三,通过利用链接空间上的局部性,我们提出了基于链接的Web分区的概念,以减少系统的通信开销。这种方法不仅减少了搜寻器之间要交换的内部链接的数量,而且减少了DHT覆盖上的路由成本。为了结合上述两种Web分区方法的优势,我们提出了2种分布式多目标Web分区方法。最后,在不同的数据集和不同的系统规模下,将本文提出的所有方法与现有系统模型进行了仿真实验比较。在大多数情况下,新方法显示出它们的优越性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号