【24h】

Estimating the size of hidden data sources by queries

机译:通过查询估计隐藏数据源的大小

获取原文

摘要

The sizes of hidden data sources are of great interests to public, researchers and even business competitors. Estimating the size of hidden data sources has been a challenging problem. Most existing methods are derived from the classic capture-recapture methods. Another approach is based on a large query pool. This method is not accurate due to the large variance of the document frequencies of queries in the query pool. Targeting this problem, we propose a new method to reduce the variance by constructing a query pool from a sample of the target data source so that document frequency variance is reduced, yet most of the documents can be covered. Our method is tested on a variety of large textual corpora, and outperforms the baseline random query method and the Broder et al's estimation method on all the datasets.
机译:隐藏数据源的大小对公共,研究人员甚至商业竞争对手具有很大的兴趣。估计隐藏数据源的大小是一个具有挑战性的问题。大多数现有方法源自经典捕获重新捕获方法。另一种方法是基于大查询池。由于查询池中的查询频率的大方差,此方法不准确。针对此问题,我们提出了一种新方法来通过从目标数据源的样本构建查询池来减少方差,以便降低文档频率方差,但大多数文档都可以覆盖。我们的方法在各种大型文本语料库上进行了测试,并且优于基线随机查询方法和Broder等人在所有数据集上的估计方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号