首页> 外文期刊>Journal of information and computational science >A High Efficient Incremental Microblog Crawler:Design and Implementation
【24h】

A High Efficient Incremental Microblog Crawler:Design and Implementation

机译:一种高效的增量式微博客爬虫:设计与实现

获取原文
获取原文并翻译 | 示例

摘要

With the rapid development of microblog technology, many interesting research issues in microblog have aroused growing attention. Fetching data from microblog is the groundwork of these researches. In this paper we propose a flexible multithreading microblog crawling architecture based on the classic multi-producers and multi-consumers model, and further implement a high efficient incremental microblog crawler towards Sina Microblog (also called Weibo). The designed crawler can solve the vertical crawling, dynamic webpage and automatic loginning problems which can't be solved by the general crawler. Meanwhile it can achieve high-precision structured webdata extraction. Some measurements are designed to evaluate the crawling performance. Experimental results demonstrate that the crawler can achieve over 95% coverage and a good freshness.
机译:随着微博技术的飞速发展,微博中许多有趣的研究问题引起了越来越多的关注。从微博中获取数据是这些研究的基础。本文提出了一种基于经典的多生产者和多消费者模型的灵活的多线程微博爬虫架构,并进一步实现了针对新浪微博(也称为微博)的高效增量式微博爬虫。设计的爬虫可以解决一般爬虫无法解决的垂直爬虫,动态网页和自动登录问题。同时,它可以实现高精度的结构化Web数据提取。一些测量旨在评估爬网性能。实验结果表明,该履带可以达到95%以上的覆盖率和良好的新鲜度。

著录项

  • 来源
    《Journal of information and computational science》 |2013年第6期|1731-1747|共17页
  • 作者单位

    The Research Center of Computational Experiments and Parallel System Technology, College of Information Systems and Management, National University of Defense Technology Changsha 410073, China;

    The Research Center of Computational Experiments and Parallel System Technology, College of Information Systems and Management, National University of Defense Technology Changsha 410073, China;

    The Research Center of Computational Experiments and Parallel System Technology, College of Information Systems and Management, National University of Defense Technology Changsha 410073, China;

    The Research Center of Computational Experiments and Parallel System Technology, College of Information Systems and Management, National University of Defense Technology Changsha 410073, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Sina Microblog; Incremental Crawling; Webpage Extraction;

    机译:新浪微博;增量爬网;网页提取;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号