首页> 外文会议>International Conference on Behavioral, Economic, Socio – Cultural Computing >DGWC: Distributed and generic web crawler for online information extraction
【24h】

DGWC: Distributed and generic web crawler for online information extraction

机译:DGWC:用于在线信息提取的分布式和通用Web爬网

获取原文

摘要

Online information has become important data source to analyze the public opinion and behavior, which is significant for social management and business decision. Web crawler systems target at automatically download and parse web pages to extract expected online information. However, as the rapid increasing of web pages and the heterogeneous page structures, the performance and the rules of parsing have become two serious challenges to web crawler systems. In this paper, we propose a distributed and generic web crawler system (DGWC), in which spiders are scheduled to parallel access and parse web pages to improve performance, utilized a shared and memory based database. Furthermore, we package the spider program and the dependencies in a container called Docker to make the system easily horizontal scaling. Last but not the least, a statistics-based approach is proposed to extract the main text using supervised-learning classifier instead of parsing the page structures. Experimental results on real-world data validate the efficiency and effectiveness of DGWC.
机译:在线信息已成为分析公众舆论和行为的重要数据源,这对于社会管理和业务决策具有重要意义。 Web爬网系统目标自动下载并解析网页以提取预期的在线信息。然而,随着网页和异构页面结构的快速增加,解析的性能和规则已经成为Web履带系统的两个严重挑战。在本文中,我们提出了一种分布式和通用Web爬网爬网系统(DGWC),其中蜘蛛计划并行访问和解析网页以提高性能,利用共享和基于存储器的数据库。此外,我们打包蜘蛛计划和依赖于名为Docker的容器中,以使系统轻松横向缩放。最后但并非最不重要的是,建议使用基于统计的方法来利用监督学习分类器提取主文本,而不是解析页面结构。实验结果对现实世界数据验证了DGWC的效率和有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号