Implementation of a distributed web community crawler

机译：分布式Web社区搜寻器的实现

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

A web community is an important space for online users to exchange information, ideas and thoughts. Due to collective intelligence of the web communities, marketing and advertisement activities have been highly focused on these sites. While articles in the web communities are open to the public, they cannot be easily collected and analyzed, because they are written in natural languages and their formats are diverse. Though many web crawlers are avaialble, they are not good at gathering web documents. First, the URLs of web articles are frequently changed and redundant, which will make the crawling job difficult. Second, the amount of articles is significantly large that the crawler should be designed in a scalable manner. Therefore, we propose a distributed web crawler optimized for collecting articles from popular communities. From the experiemnts we showed that our implementation achieves high throughput compared with the open-source crawler, Nutch.

机译：网络社区是在线用户交流信息，思想和思想的重要空间。由于网络社区的集体智慧，营销和广告活动已高度集中在这些站点上。尽管网络社区中的文章向公众开放，但是由于它们以自然语言编写且格式多种多样，因此无法轻松地对其进行收集和分析。尽管许多Web搜寻器都可用，但是它们并不擅长收集Web文档。首先，Web文章的URL经常更改和冗余，这将使抓取工作变得困难。其次，文章的数量非常大，应以可伸缩的方式设计搜寻器。因此，我们提出了一种分布式Web爬网程序，该爬网程序已优化为从流行社区收集文章。从实验中我们可以看出，与开源抓取工具Nutch相比，我们的实现实现了高吞吐量。

著录项

来源
《Asia-Pacific Network Operations and Management Symposium》|2014年|1-6|共6页
会议地点
作者
Seonyoung Park; Youngseok Lee;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Internet; information retrieval; public domain software; Nutch open-source crawler; Web document gathering; advertisement activity; collective intelligence; distributed Web community crawler; marketing activity; Communities; Crawlers; Linux; Throughput; Uniform resource locators; Web pages; Distributed web crawler; community; web forum;

机译：互联网;信息检索;公共领域软件; Nutch开源爬虫; Web文档收集;广告活动;集体情报;分布式Web社区爬虫;营销活动;社区;爬虫; Linux;吞吐量;统一资源定位符;网页;分布式Web爬虫;社区;网络论坛;

相似文献

外文文献
中文文献
专利

1. Application of Distributed Web Crawlers in Information Management System | Wen | Informatica [J] . Bo Wen Informatica: An International Journal of Computing and Informatics . 2018,第1期

机译：分布式Web爬虫在信息管理系统中的应用。温|信息学
2. Distributed Web Crawlers using Hadoop [J] . Pratiba D., Shobha G., Lalith Kumar H., International Journal of Applied Engineering Research . 2017,第24aPta8期

机译：使用Hadoop分布式Web爬虫器
3. An Ontology Based Crawler for Retrieving Information Distributed on the Web [J] . Wael A. Gab–Allah, Ben Bella S. Tawfik, Hamed M. Nassar International Journal of Engineering Research and Applications . 2016,第6期

机译：基于本体的爬虫，用于检索Web上分布的信息
4. Implementation of a distributed web community crawler [C] . Seonyoung Park, Youngseok Lee Asia-Pacific Network Operations and Management Symposium . 2014

机译：实施分布式网络社区爬虫
5. Design and implementation of an intelligent Web crawler for corporate data scraping. [D] . Qin, Xinfeng. 2007

机译：用于企业数据抓取的智能Web搜寻器的设计和实现。
6. A user-oriented web crawler for selectively acquiring online content in e-health research [O] . Songhua Xu, Hong-Jun Yoon, Georgia Tourassi -1

机译：面向用户的网络爬虫用于在电子卫生研究中选择性地获取在线内容
7. Design and Implementation of Scalable, Fully Distributed Web Crawler for a Web Search Engine [O] . M. Sunil Kumar 2011

机译：Web搜索引擎的可扩展，完全分布式Web爬网程序的设计和实现

Implementation of a distributed web community crawler

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅