首页> 外文会议>IEEE International Congress on Big Data >Optimizing apache nutch for domain specific crawling at large scale
【24h】

Optimizing apache nutch for domain specific crawling at large scale

机译:针对大规模特定于域的爬网优化apache nutch

获取原文

摘要

Focused crawls are key to acquiring data at large scale in order to implement systems like domain search engines and knowledge databases. Focused crawls introduce non trivial problems to the already difficult problem of web scale crawling; To address some of these issues, BCube - a building block of the National Science Foundation's EarthCube program - has developed a tailored version of Apache Nutch for data and web services discovery at scale. We describe how we started with a vanilla version of Apache Nutch and how we optimized and scaled it to reach gigabytes of discovered links and almost half a billion documents of interest crawled so far.
机译:集中抓取是大规模获取数据的关键,以实现诸如域搜索引擎和知识数据库之类的系统。集中爬网将非平凡的问题引入了已经很困难的Web规模爬网问题。为了解决其中的一些问题,BCube(美国国家科学基金会EarthCube计划的组成部分)已经开发了量身定制的Apache Nutch版本,用于大规模发现数据和Web服务。我们描述了我们是如何从原始版本的Apache Nutch开始的,以及如何优化和扩展它以达到千兆字节的已发现链接以及到目前为止已爬取的将近十亿篇感兴趣的文档。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号