Optimizing apache nutch for domain specific crawling at large scale

机译：针对大规模特定于域的爬网优化apache nutch

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Focused crawls are key to acquiring data at large scale in order to implement systems like domain search engines and knowledge databases. Focused crawls introduce non trivial problems to the already difficult problem of web scale crawling; To address some of these issues, BCube - a building block of the National Science Foundation's EarthCube program - has developed a tailored version of Apache Nutch for data and web services discovery at scale. We describe how we started with a vanilla version of Apache Nutch and how we optimized and scaled it to reach gigabytes of discovered links and almost half a billion documents of interest crawled so far.

机译：集中抓取是大规模获取数据的关键，以实现诸如域搜索引擎和知识数据库之类的系统。集中爬网将非平凡的问题引入了已经很困难的Web规模爬网问题。为了解决其中的一些问题，BCube（美国国家科学基金会EarthCube计划的组成部分）已经开发了量身定制的Apache Nutch版本，用于大规模发现数据和Web服务。我们描述了我们是如何从原始版本的Apache Nutch开始的，以及如何优化和扩展它以达到千兆字节的已发现链接以及到目前为止已爬取的将近十亿篇感兴趣的文档。

著录项

来源
《IEEE International Congress on Big Data》|2015年|1967-1971|共5页
会议地点
作者
Lopez Luis A.; Duerr Ruth; Khalsa Siri Jodha Singh;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Apache Nutch; big data; data discovery; focused crawl;

机译：Apache Nutch;大数据;数据发现;集中爬网;

相似文献

外文文献
中文文献
专利

1. Unsupervised domain ranking in large-scale web crawls [J] . Mercedes Martinez Gonzalez Computing reviews . 2019,第7期

机译：大型Web爬网中的无监督域排名
2. Unsupervised domain ranking in large-scale web crawls [J] . Mercedes Martinez Gonzalez Computing reviews . 2019,第7期

机译：无监督的域名在大型Web爬网中排名
3. Unsupervised Domain Ranking in Large-Scale Web Crawls [J] . Cui Yi, Sparkman Clint, Lee Hsin-Tsang, ACM transactions on the web . 2018,第4期

机译：大型Web爬网中的无监督域排名
4. Optimizing Apache Nutch For Domain Specific Crawling at Large Scale [C] . Luis A. Lopez, Ruth Duerr, Siri Jodha Singh Khalsa IEEE International Conference on Big Data . 2015

机译：大规模优化Apache Nutch的域特定爬行
5. A novel hybrid focused crawling algorithm to build domain-specific collections. [D] . Chen, Yuxin. 2007

机译：一种新颖的混合重点爬网算法，用于构建特定于域的集合。
6. Mathematical model for empirically optimizing large scale production of soluble protein domains [O] . Eisuke Chikayama, Atsushi Kurotani, Takanori Tanaka, 2010

机译：用于经验优化可溶性蛋白结构域大规模生产的数学模型
7. An Extended Model for Effective Migrating Parallel Web Crawling with Domain Specific and Incremental Crawling [O] . Md. Faizan Farooqui 2012

机译：具有域特定和增量爬网的有效迁移并行网爬行的扩展模型

Optimizing apache nutch for domain specific crawling at large scale

摘要

著录项

相似文献

相关主题

期刊订阅