首页> 外文会议>IEEE International Conference on Computer and Communications >Research on Scrapy-Based Distributed Crawler System for Crawling Semi-structure Information at High Speed
【24h】

Research on Scrapy-Based Distributed Crawler System for Crawling Semi-structure Information at High Speed

机译:基于SCRAPE的分布式履带系统,用于高速爬网的半结构信息

获取原文

摘要

For the following problems: the semi-structure information on the web pages of the video website is complicated and the utilization rate is low, the data collection efficiency of the single machine crawler is low, this paper proposed a Scrapy-based distributed crawler system for crawling semi-structure information at high speed. The traditional single crawler proposed by this paper developed an improved scheme of distributed extension. In this scheme, the Scrapy-Redis distributed component and Redis database were introduced into the Scrapy framework, and the semi-structured information crawling and standardized storage strategy was set up, and Scrapy-based distributed crawler system for crawling semi-structure information at high speed was implemented. This paper verified the system by crawling video site Youku, SOHU, Tencent, iQIYI TV drama information. The experimental results showed that the crawling speed of the distributed crawler is increased by 84.53%, 88.95%, 93.05% and 100% respectively compared with that of the single machine crawler.
机译:对于以下问题:视频网站网页上的半结构信息复杂,利用率低,单机履带的数据收集效率低,本文提出了一种基于SCRAPE的分布式履带系统高速爬网半结构信息。本文提出的传统单一履带制定了一种改进的分布式延伸方案。在该方案中,Scrapy-Redis分布式组件和Redis数据库被引入Scrapy框架,并建立了半结构性信息爬行和标准化存储策略,以及基于SCRAPE的分布式履带系统,用于高处爬网速度实施。本文通过爬行视频网站Youku,Sohu,腾讯,Iqiyi电视剧信息验证了该系统。实验结果表明,与单机履带的分布履带的爬行速度分别增加了84.53%,88.95%,93.05%和100%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号