首页> 外文会议>Algorithms and architectures for parallel processing >The Research and Implementation of a Distributed Crawler System Based on Apache Flink
【24h】

The Research and Implementation of a Distributed Crawler System Based on Apache Flink

机译:基于Apache Flink的分布式爬虫系统的研究与实现

获取原文
获取原文并翻译 | 示例

摘要

Web information is growing at an explosive rate. The crawling ability of the single-machine crawler becomes the bottleneck, so distributed web crawling techniques become the focus of research. However, the existing distributed web crawler systems have some shortcomings. Thread management for solving thread synchronization and resource competition is usually designed by using pure multi-thread asynchronous methods. But the execution of this mechanism observably reduces the performance. Moreover, the deduplication algorithms lead to low efficiency in dealing with large data sets or the problem of occupying large storage space. Therefore, we propose and implement a distributed web crawler system based on Apache Flink, which combines and integrates the Mesos/Marathon framework. It can make full use of the computing resources of the cluster and significantly improve the efficiency of the web crawler system. Taking the data of Netease news pages as an example, the experimental results show that the distributed crawler proposed has higher execution efficiency and reliability.
机译:网络信息正以爆炸性的速度增长。单机爬行器的爬行能力成为瓶颈,因此分布式Web爬行技术成为研究的重点。但是,现有的分布式网络爬虫系统存在一些缺陷。解决线程同步和资源竞争的线程管理通常是使用纯多线程异步方法设计的。但是执行此机制显然会降低性能。而且,重复数据删除算法导致处理大数据集的效率低下或占用大存储空间的问题。因此,我们提出并实现了一个基于Apache Flink的分布式Web爬虫系统,该系统结合并集成了Mesos / Marathon框架。它可以充分利用群集的计算资源,并显着提高Web爬网程序系统的效率。以网易新闻页面数据为例,实验结果表明,提出的分布式爬虫具有较高的执行效率和可靠性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号