首页> 外文会议>International conference on very large data bases >Optimal Algorithms for Crawling A Hidden Database in the Web
【24h】

Optimal Algorithms for Crawling A Hidden Database in the Web

机译:在网络中搜寻隐藏数据库的最佳算法

获取原文
获取外文期刊封面目录资料

摘要

A hidden database refers to a dataset that an organization makes accessible on the web by allowing users to issue queries through a search interface. In other words, data acquisition from such a source is not by following static hyper-links. Instead, data are obtained by querying the interface, and reading the result page dynamically generated. This, with other facts such as the interface may answer a query only partially, has prevented hidden databases from being crawled effectively by existing search engines. This paper remedies the problem by giving algorithms to extract all the tuples from a hidden database. Our algorithms are provably efficient, namely, they accomplish the task by performing only a small number of queries, even in the worst case. We also establish theoretical results indicating that these algorithms are asymptotically optimal - i.e., it is impossible to improve their efficiency by more than a constant factor. The derivation of our upper and lower bound results reveals significant insight into the characteristics of the underlying problem. Extensive experiments confirm the proposed techniques work very well on all the real datasets examined.
机译:隐藏数据库是指组织通过允许用户通过搜索界面发出查询而可以在Web上访问的数据集。换句话说,不是通过遵循静态超链接来从这样的源获取数据。而是通过查询接口并读取动态生成的结果页来获取数据。结合其他事实(例如界面可能仅部分回答查询),这已阻止了隐藏的数据库被现有搜索引擎有效地抓取。本文通过提供从隐藏数据库中提取所有元组的算法来解决该问题。我们的算法证明是高效的,也就是说,即使在最坏的情况下,它们也仅通过执行少量查询来完成任务。我们还建立了理论结果,表明这些算法是渐近最优的-也就是说,不可能将其效率提高一个以上的常数。我们的上限和下限结果的推导揭示了对潜在问题特征的重要见解。大量的实验证实了所提出的技术在所检查的所有真实数据集上都能很好地工作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号