首页> 外文期刊>Journal of The Institution of Engineers (India): Series B >Optimizing Crawler4j using MapReduce Programming Model
【24h】

Optimizing Crawler4j using MapReduce Programming Model

机译:使用MapReduce编程模型优化Crawler4j

获取原文
获取原文并翻译 | 示例
           

摘要

World wide web is a decentralized system that consists of a repository of information on the basis of web pages. These web pages act as a source of information or data in the present analytics world. Web crawlers are used for extracting useful information from web pages for different purposes. Firstly, it is used in web search engines where the web pages are indexed to form a corpus of information and allows the users to query on the web pages. Secondly, it is used for web archiving where the web pages are stored for later analysis phases. Thirdly, it can be used for web mining where the web pages are monitored for copyright purposes. The amount of information processed by the web crawler needs to be improved by using the capabilities of modern parallel processing technologies. In order to solve the problem of parallelism and the throughput of crawling this work proposes to optimize the Crawler4j using the Hadoop MapReduce programming model by parallelizing the processing of large input data. Crawler4j is a web crawler that retrieves useful information about the pages that it visits. The crawler Crawler4j coupled with data and computational parallelism of Hadoop MapReduce programming model improves the throughput and accuracy of web crawling. The experimental results demonstrate that the proposed solution achieves significant improvements with respect to performance and throughput. Hence the proposed approach intends to carve out a new methodology towards optimizing web crawling by achieving significant performance gain.
机译:万维网是一种分散的系统,由基于网页的信息存储库组成。这些网页充当当前分析世界中的信息或数据源。 Web搜寻器用于从网页中提取有用的信息以用于不同的目的。首先,它被用于Web搜索引擎中,在Web引擎中对网页进行索引以形成信息语料库,并允许用户在网页上查询。其次,它用于Web归档,其中存储了网页以供以后的分析阶段使用。第三,它可以用于网络挖掘,其中出于版权目的监视网页。通过使用现代并行处理技术的功能,需要提高Web搜寻器处理的信息量。为了解决并行性和爬网吞吐量的问题,这项工作建议使用Hadoop MapReduce编程模型,通过并行处理大型输入数据来优化Crawler4j。 Crawler4j是一个Web搜寻器,它检索有关其访问的页面的有用信息。搜寻器Crawler4j结合Hadoop MapReduce编程模型的数据和计算并行性提高了Web搜寻的吞吐量和准确性。实验结果表明,所提出的解决方案在性能和吞吐量方面取得了显着改善。因此,所提出的方法旨在通过实现显着的性能提升,开发出一种用于优化Web爬网的新方法。

著录项

  • 来源
  • 作者单位

    Department of Information Science and Engineering, M S Ramaiah Institute of Technology, Bangalore, India;

    Department of Information Science and Engineering, M S Ramaiah Institute of Technology, Bangalore, India;

    Department of Information Science and Engineering, M S Ramaiah Institute of Technology, Bangalore, India;

    Department of Information Science and Engineering, M S Ramaiah Institute of Technology, Bangalore, India;

    Department of Information Science and Engineering, M S Ramaiah Institute of Technology, Bangalore, India;

    Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore, India;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Web crawler; Crawler4j; Hadoop; MapReduce; Crawler4j with hadoop;

    机译:Web搜寻器;Crawler4j;Hadoop;MapReduce;爬虫的Crawler4j;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号