【24h】

Crawling Ranked Deep Web Data Sources

机译:爬行排名深的Web数据源

获取原文

摘要

In the era of big data, the vast majority of the data are not from the surface web, the web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep web, the web that is hidden behind query interfaces. Since the data in the deep web are often of high value, there is a line of research on crawling deep web data sources in the recent decade. However, most existing crawling methods assume that all the matched documents are returned. In practice, many data sources rank the matched documents, and return only the top k matches. When conventional methods are applied on such ranked data sources, popular queries that matches more than k documents will cause large redundancy. This paper proposes the document frequency (df) based algorithm that exploits the queries whose document frequencies are within the specified range. The algorithm is extensively tested on a variety of datasets and compared with existing two algorithms.We demonstrate that our method outperforms the two algorithms 58% and 90% on average respectively.
机译:在大数据的时代,绝大多数数据不是从表面Web,由超链接互连的Web,并由大多数通用搜索引擎索引。相反,有价值的数据的格式通常驻留在Deep Web中,隐藏在查询接口后面的Web。由于深网络中的数据往往具有高价值,因此近十年来,存在关于爬行深网络数据源的研究。但是,大多数现有的爬网方法假设返回所有匹配的文档。在实践中,许多数据源位等待匹配的文档,并仅返回顶部K匹配。当常规方法应用于此类排名的数据源时,匹配超过K文档的流行查询将导致冗余冗余。本文提出了基于文档频率(DF)的算法,该算法利用其文档频率在指定范围内的查询。该算法在各种数据集的广泛的测试,并与现有的两algorithms.We证明我们的方法分别优于两种算法58%和90%的平均进行比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号