首页> 外文会议>Asia-Pacific Network Operations and Management Symposium >Implementation of a distributed web community crawler
【24h】

Implementation of a distributed web community crawler

机译:分布式Web社区搜寻器的实现

获取原文

摘要

A web community is an important space for online users to exchange information, ideas and thoughts. Due to collective intelligence of the web communities, marketing and advertisement activities have been highly focused on these sites. While articles in the web communities are open to the public, they cannot be easily collected and analyzed, because they are written in natural languages and their formats are diverse. Though many web crawlers are avaialble, they are not good at gathering web documents. First, the URLs of web articles are frequently changed and redundant, which will make the crawling job difficult. Second, the amount of articles is significantly large that the crawler should be designed in a scalable manner. Therefore, we propose a distributed web crawler optimized for collecting articles from popular communities. From the experiemnts we showed that our implementation achieves high throughput compared with the open-source crawler, Nutch.
机译:网络社区是在线用户交流信息,思想和思想的重要空间。由于网络社区的集体智慧,营销和广告活动已高度集中在这些站点上。尽管网络社区中的文章向公众开放,但是由于它们以自然语言编写且格式多种多样,因此无法轻松地对其进行收集和分析。尽管许多Web搜寻器都可用,但是它们并不擅长收集Web文档。首先,Web文章的URL经常更改和冗余,这将使抓取工作变得困难。其次,文章的数量非常大,应以可伸缩的方式设计搜寻器。因此,我们提出了一种分布式Web爬网程序,该爬网程序已优化为从流行社区收集文章。从实验中我们可以看出,与开源抓取工具Nutch相比,我们的实现实现了高吞吐量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号