Optimal Algorithms for Crawling A Hidden Database in the Web

机译：在网络中搜寻隐藏数据库的最佳算法

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

A hidden database refers to a dataset that an organization makes accessible on the web by allowing users to issue queries through a search interface. In other words, data acquisition from such a source is not by following static hyper-links. Instead, data are obtained by querying the interface, and reading the result page dynamically generated. This, with other facts such as the interface may answer a query only partially, has prevented hidden databases from being crawled effectively by existing search engines. This paper remedies the problem by giving algorithms to extract all the tuples from a hidden database. Our algorithms are provably efficient, namely, they accomplish the task by performing only a small number of queries, even in the worst case. We also establish theoretical results indicating that these algorithms are asymptotically optimal - i.e., it is impossible to improve their efficiency by more than a constant factor. The derivation of our upper and lower bound results reveals significant insight into the characteristics of the underlying problem. Extensive experiments confirm the proposed techniques work very well on all the real datasets examined.

机译：隐藏数据库是指组织通过允许用户通过搜索界面发出查询而可以在Web上访问的数据集。换句话说，不是通过遵循静态超链接来从这样的源获取数据。而是通过查询接口并读取动态生成的结果页来获取数据。结合其他事实（例如界面可能仅部分回答查询），这已阻止了隐藏的数据库被现有搜索引擎有效地抓取。本文通过提供从隐藏数据库中提取所有元组的算法来解决该问题。我们的算法证明是高效的，也就是说，即使在最坏的情况下，它们也仅通过执行少量查询来完成任务。我们还建立了理论结果，表明这些算法是渐近最优的-也就是说，不可能将其效率提高一个以上的常数。我们的上限和下限结果的推导揭示了对潜在问题特征的重要见解。大量的实验证实了所提出的技术在所检查的所有真实数据集上都能很好地工作。

著录项

来源
《International conference on very large data bases》|2012年|1112-1123|共12页
会议地点
作者
Cheng Sheng; Nan Zhang; Yufei Tao; Xin Jin;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Genetic algorithm-based intelligent multiagent architecture for extracting information from hidden web databases [J] . International Journal of Business Intelligence and Data Mining . 2020,第2期

机译：基于遗传算法的智能多主体架构，可从隐藏的Web数据库中提取信息
2. Optimal Web Page Download Scheduling Policies for Green Web Crawling [J] . Vassiliki Hatzi, B. Barla Cambazoglu, Iordanis Koutsopoulos IEEE Journal on Selected Areas in Communications . 2016,第5期

机译：绿色网页爬网的最佳网页下载调度策略
3. Focused crawling for the hidden web [J] . F. Can Computing reviews . 2017,第1期

机译：集中抓取隐藏的网页
4. Optimal Algorithms for Crawling A Hidden Database in the Web [C] . Cheng Sheng, Nan Zhang, Yufei Tao, International conference on very large data bases . 2012

机译：用于在Web中爬行隐藏数据库的最佳算法
5. Crawling and searching the hidden Web. [D] . Ntoulas, Alexandros. 2006

机译：搜寻和搜索隐藏的Web。
6. eSkip-Finder: a machine learning-based web application and database to identify the optimal sequences of antisense oligonucleotides for exon skipping [O] . Shuntaro Chiba, Kenji Rowel Q Lim, Narin Sheri, 2021

机译：Eskip-Finder：基于机器的基于机器的Web应用程序和数据库用于识别外显子跳跃的反义寡核苷酸的最佳序列
7. Optimal algorithms for crawling a hidden database in the web [O] . Cheng Sheng, Nan Zhang, Yufei Tao, 2012

机译：用于在Web中抓取隐藏数据库的最佳算法

Optimal Algorithms for Crawling A Hidden Database in the Web

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅