DGWC: Distributed and generic web crawler for online information extraction

机译：DGWC：用于在线信息提取的分布式和通用Web爬网

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Online information has become important data source to analyze the public opinion and behavior, which is significant for social management and business decision. Web crawler systems target at automatically download and parse web pages to extract expected online information. However, as the rapid increasing of web pages and the heterogeneous page structures, the performance and the rules of parsing have become two serious challenges to web crawler systems. In this paper, we propose a distributed and generic web crawler system (DGWC), in which spiders are scheduled to parallel access and parse web pages to improve performance, utilized a shared and memory based database. Furthermore, we package the spider program and the dependencies in a container called Docker to make the system easily horizontal scaling. Last but not the least, a statistics-based approach is proposed to extract the main text using supervised-learning classifier instead of parsing the page structures. Experimental results on real-world data validate the efficiency and effectiveness of DGWC.

机译：在线信息已成为分析公众舆论和行为的重要数据源，这对于社会管理和业务决策具有重要意义。 Web爬网系统目标自动下载并解析网页以提取预期的在线信息。然而，随着网页和异构页面结构的快速增加，解析的性能和规则已经成为Web履带系统的两个严重挑战。在本文中，我们提出了一种分布式和通用Web爬网爬网系统（DGWC），其中蜘蛛计划并行访问和解析网页以提高性能，利用共享和基于存储器的数据库。此外，我们打包蜘蛛计划和依赖于名为Docker的容器中，以使系统轻松横向缩放。最后但并非最不重要的是，建议使用基于统计的方法来利用监督学习分类器提取主文本，而不是解析页面结构。实验结果对现实世界数据验证了DGWC的效率和有效性。

著录项

来源
《International Conference on Behavioral, Economic, Socio – Cultural Computing》|2016年|1 v.|共6页
会议地点
作者
Lu Zhang; Zhan Bu; Zhiang Wu; Jie Cao;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
Web pages; Crawlers; HTML; Feature extraction; Data mining; Containers; Uniform resource locators;

机译：网页;爬虫;HTML;特征提取;数据挖掘;容器;统一资源定位器;

相似文献

外文文献
中文文献
专利

1. A generic framework for extraction of knowledge from social web sources (social networking websites) for an online recommendation system [J] . Javubar Sathick, Jaya Venkat International Review of Research in Open and Distributed Learning . 2015,第2期

机译：从社交网络资源（社交网站）中提取知识以用于在线推荐系统的通用框架
2. Silent geographical spread of the H7N9 virus by online knowledge analysis of the live bird trade with a distributed focused crawler [J] . Chen Chen, Shan Lu, Pengcheng Du, Emerging microbes & infections. . 2013,第12期

机译：通过使用分布式聚焦爬虫对活禽贸易进行在线知识分析，H7N9病毒的地理分布无声
3. Application of Distributed Web Crawlers in Information Management System | Wen | Informatica [J] . Bo Wen Informatica: An International Journal of Computing and Informatics . 2018,第1期

机译：分布式Web爬虫在信息管理系统中的应用。温|信息学
4. DGWC: Distributed and generic web crawler for online information extraction [C] . Lu Zhang, Zhan Bu, Zhiang Wu, Proceedings of 2016 International Conference on Behavioral, Economic, Socio – Cultural Computing . 2016

机译：DGWC：用于在线信息提取的分布式通用Web搜寻器
5. Extraction of ontology and semantic web information from online business reports [D] . Simmons, Lakisha L. 2011

机译：从在线业务报告中提取本体和语义Web信息
6. A user-oriented web crawler for selectively acquiring online content in e-health research [O] . Songhua Xu, Hong-Jun Yoon, Georgia Tourassi -1

机译：面向用户的网络爬虫用于在电子卫生研究中选择性地获取在线内容
7. Dis-Dyn Crawler:A Distributed Crawler for Dynamic Web Page [O] . Jianfu Cai, Hua Zhang 2015

机译：DIS-DYN爬网程序：动态网页的分布式爬网

DGWC: Distributed and generic web crawler for online information extraction

摘要

著录项

相似文献

相关主题

期刊订阅