首页> 外文期刊>Nature reviews Cancer >Efficient Multi-threaded Crawling Using In Memory Data Structures
【24h】

Efficient Multi-threaded Crawling Using In Memory Data Structures

机译:在内存数据结构中使用高效的多线程爬网

获取原文
获取原文并翻译 | 示例
           

摘要

Crawling the internet is an important task for any search engine. A crawler is a software program that sends HTTP requests to various webservers available on the world datasphere and downloads their contents. As the size of the internet has gone through a big bang in the last decade, designing efficient parallel crawlers became a necessity. One of the factors that degrades the crawler performance is the disk access every time a file is written. As the process of crawling the web requires the download of tens or hundreds of millions of webpages, much time will be consumed in disk writes due to the seek times. This work presents an efficient multi-threaded crawler that incorporates an in-memory data structure to reduce the overall disk write times. The results show that the proposed technique can increase the throughput by about 50% at selected values of size of the in-memory data structure over the normal multi-threaded crawler with no in-memory data structure. In addition, the results show that this design can achieve an average crawler speed of 22 pages/sec which supersedes previously reported work.
机译:爬行互联网是任何搜索引擎的重要任务。爬虫是一个软件程序,将HTTP请求发送到世界DatSphere上可用的各种Web服务器并下载其内容。随着互联网的规模在过去十年中经历了一个大爆炸,设计有效的平行爬虫成为必需品。降低爬虫性能的因素之一是每次写入文件时都是磁盘访问。由于爬行的过程需要下载数十或数亿个网页,因此由于寻求时代,在磁盘写入中将消耗大量时间。这项工作介绍了一个有效的多线程爬虫,它包含内存数据结构,以减少整体磁盘写入时间。结果表明,在普通多线程履带上,所提出的技术可以将吞吐量提高到内存数据结构大小的所选值,而不是内存数据结构。此外,结果表明,这种设计可以达到22页/秒的平均履带速度,这是先前报道的工作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号