Optimizing Crawler4j using MapReduce Programming Model

G. M. Siddesh; Kavya Suresh; K. Y. Madhuri; Madhushree Nijagal; B. R. Rakshitha; K. G. Srinivasa

首页> 外文期刊>Journal of The Institution of Engineers (India): Series B >Optimizing Crawler4j using MapReduce Programming Model

【24h】

Optimizing Crawler4j using MapReduce Programming Model

机译：使用MapReduce编程模型优化Crawler4j

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

World wide web is a decentralized system that consists of a repository of information on the basis of web pages. These web pages act as a source of information or data in the present analytics world. Web crawlers are used for extracting useful information from web pages for different purposes. Firstly, it is used in web search engines where the web pages are indexed to form a corpus of information and allows the users to query on the web pages. Secondly, it is used for web archiving where the web pages are stored for later analysis phases. Thirdly, it can be used for web mining where the web pages are monitored for copyright purposes. The amount of information processed by the web crawler needs to be improved by using the capabilities of modern parallel processing technologies. In order to solve the problem of parallelism and the throughput of crawling this work proposes to optimize the Crawler4j using the Hadoop MapReduce programming model by parallelizing the processing of large input data. Crawler4j is a web crawler that retrieves useful information about the pages that it visits. The crawler Crawler4j coupled with data and computational parallelism of Hadoop MapReduce programming model improves the throughput and accuracy of web crawling. The experimental results demonstrate that the proposed solution achieves significant improvements with respect to performance and throughput. Hence the proposed approach intends to carve out a new methodology towards optimizing web crawling by achieving significant performance gain.

机译：万维网是一种分散的系统，由基于网页的信息存储库组成。这些网页充当当前分析世界中的信息或数据源。 Web搜寻器用于从网页中提取有用的信息以用于不同的目的。首先，它被用于Web搜索引擎中，在Web引擎中对网页进行索引以形成信息语料库，并允许用户在网页上查询。其次，它用于Web归档，其中存储了网页以供以后的分析阶段使用。第三，它可以用于网络挖掘，其中出于版权目的监视网页。通过使用现代并行处理技术的功能，需要提高Web搜寻器处理的信息量。为了解决并行性和爬网吞吐量的问题，这项工作建议使用Hadoop MapReduce编程模型，通过并行处理大型输入数据来优化Crawler4j。 Crawler4j是一个Web搜寻器，它检索有关其访问的页面的有用信息。搜寻器Crawler4j结合Hadoop MapReduce编程模型的数据和计算并行性提高了Web搜寻的吞吐量和准确性。实验结果表明，所提出的解决方案在性能和吞吐量方面取得了显着改善。因此，所提出的方法旨在通过实现显着的性能提升，开发出一种用于优化Web爬网的新方法。

著录项

来源
《Journal of The Institution of Engineers (India): Series B》 |2017年第3期|329-336|共8页
作者
G. M. Siddesh; Kavya Suresh; K. Y. Madhuri; Madhushree Nijagal; B. R. Rakshitha; K. G. Srinivasa;
展开▼
作者单位

Department of Information Science and Engineering, M S Ramaiah Institute of Technology, Bangalore, India;

Department of Information Science and Engineering, M S Ramaiah Institute of Technology, Bangalore, India;

Department of Information Science and Engineering, M S Ramaiah Institute of Technology, Bangalore, India;

Department of Information Science and Engineering, M S Ramaiah Institute of Technology, Bangalore, India;

Department of Information Science and Engineering, M S Ramaiah Institute of Technology, Bangalore, India;

Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore, India;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Web crawler; Crawler4j; Hadoop; MapReduce; Crawler4j with hadoop;

机译：Web搜寻器;Crawler4j;Hadoop;MapReduce;爬虫的Crawler4j;

相似文献

外文文献
中文文献
专利

1. Modeling and optimizing MapReduce programs [J] . Jens Dörre Sven Apel, Christian Lengauer Concurrency and Computation . 2015,第7期

机译：建模和优化MapReduce程序
2. A Modeling Language for MapReduce Programing in a Storage System Perspective [J] . Yuxin Jing, Hanpin Wang, Yu Huang, Journal of VLSI signal processing systems for signal, image, and video technology . 2018,第8a9期

机译：存储系统角度的MapReduce编程建模语言
3. Map-Balance-Reduce: An improved parallel programming model for load balancing of MapReduce [J] . Future generation computer systems . 2020,第Apra期

机译：Map-Balance-Reduce：一种改进的并行编程模型，用于MapReduce的负载平衡
4. Performance modeling and optimization of MapReduce programs [C] . Jinsong Yin, Yuanyuan Qiao IEEE International Conference on Cloud Computing and Intelligent Systems . 2014

机译：MapReduce程序的性能建模和优化
5. Improving Productivity of Accelerator Computing Through Programming Models and Compiler Optimizations [D] . Sakdhnagool, Putt. 2017

机译：通过编程模型和编译器优化提高加速器计算的生产力
6. Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends [O] . Emad A Mohammed, Behrouz H Far, Christopher Naugler 2014

机译：MapReduce编程框架在临床大数据分析中的应用：当前形势和未来趋势
7. Modeling and Optimizing MapReduce Programs [O] . Jens Dörre, Sven Apel, Christian Lengauer 2013

机译：建模和优化mapReduce程序

Optimizing Crawler4j using MapReduce Programming Model

摘要

著录项

相似文献

相关主题

期刊订阅