首页> 外文会议>International Conference on Web Information Systems Engineering >Crawling Ranked Deep Web Data Sources

【24h】

Crawling Ranked Deep Web Data Sources

机译：爬行排名深的Web数据源

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In the era of big data, the vast majority of the data are not from the surface web, the web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep web, the web that is hidden behind query interfaces. Since the data in the deep web are often of high value, there is a line of research on crawling deep web data sources in the recent decade. However, most existing crawling methods assume that all the matched documents are returned. In practice, many data sources rank the matched documents, and return only the top k matches. When conventional methods are applied on such ranked data sources, popular queries that matches more than k documents will cause large redundancy. This paper proposes the document frequency (df) based algorithm that exploits the queries whose document frequencies are within the specified range. The algorithm is extensively tested on a variety of datasets and compared with existing two algorithms.We demonstrate that our method outperforms the two algorithms 58% and 90% on average respectively.

机译：在大数据的时代，绝大多数数据不是从表面Web，由超链接互连的Web，并由大多数通用搜索引擎索引。相反，有价值的数据的格式通常驻留在Deep Web中，隐藏在查询接口后面的Web。由于深网络中的数据往往具有高价值，因此近十年来，存在关于爬行深网络数据源的研究。但是，大多数现有的爬网方法假设返回所有匹配的文档。在实践中，许多数据源位等待匹配的文档，并仅返回顶部K匹配。当常规方法应用于此类排名的数据源时，匹配超过K文档的流行查询将导致冗余冗余。本文提出了基于文档频率（DF）的算法，该算法利用其文档频率在指定范围内的查询。该算法在各种数据集的广泛的测试，并与现有的两algorithms.We证明我们的方法分别优于两种算法58％和90％的平均进行比较。

著录项

来源
《International Conference on Web Information Systems Engineering 》|2015年||共15页
会议地点
作者
Yan Wang; Yaxin Li; Nannan Pi; Jianguo Lu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP391-532;
关键词
Deep web crawling; Query selection; Estimation; Document frequency; Return limit;

机译：深网络爬行;查询选择;估计;文档频率;返回限制;

相似文献

外文文献
中文文献
专利

1. Crawling ranked deep Web data sources [J] . Wang Yan, Lu Jianguo, Chen Jessica, World Wide Web . 2017 ,第1期

机译：搜寻排名深的Web数据源
2. Selecting queries from sample to crawl deep web data sources [J] . Yan Wang, Jianguo Lu, Jie Liang, Web Intelligence and Agent Systems . 2012 ,第1期

机译：从样本中选择查询以爬网深层Web数据源
3. OXPath: A language for scalable data extraction, automation, and crawling on the deep web [J] . Tim Furche, Georg Gottlob, Giovanni Grasso, The VLDB journal . 2013 ,第1期

机译：OXPath：一种用于可扩展的数据提取，自动化和在深度网络上进行爬网的语言
4. Crawling Ranked Deep Web Data Sources [C] . Yan Wang, Yaxin Li, Nannan Pi, International conference on web information systems engineering . 2015

机译：搜寻排名深层Web数据源
5. Crawling the Web: Discovery and maintenance of large-scale Web data. [D] . Cho, Junghoo. 2002

机译：爬行Web：发现和维护大规模Web数据。
6. An Efficient Approach for Web Indexing of Big Data through Hyperlinks in Web Crawling [O] . R. Suganya Devi, D. Manjula, R. K. Siddharth 2015

机译：通过Web爬网中的超链接对大数据进行Web索引的一种有效方法
7. Deep web crawling for insights from polar data [O] . Siri Jodha S. Khalsa, Chris A. Mattmann, Ruth Duerr 2017

机译：深度Web爬行来自极地数据的见解
8. Focused Crawling of the Deep Web Using Service Class Descriptions [R] . Rocco, D., Liu, L., Critchlow, T. 2005

机译：使用服务类描述重点对Deep Web进行爬网

Crawling Ranked Deep Web Data Sources

摘要

著录项

相似文献

相关主题

期刊订阅