Estimating the size of hidden data sources by queries

机译：通过查询估计隐藏数据源的大小

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The sizes of hidden data sources are of great interests to public, researchers and even business competitors. Estimating the size of hidden data sources has been a challenging problem. Most existing methods are derived from the classic capture-recapture methods. Another approach is based on a large query pool. This method is not accurate due to the large variance of the document frequencies of queries in the query pool. Targeting this problem, we propose a new method to reduce the variance by constructing a query pool from a sample of the target data source so that document frequency variance is reduced, yet most of the documents can be covered. Our method is tested on a variety of large textual corpora, and outperforms the baseline random query method and the Broder et al's estimation method on all the datasets.

机译：隐藏数据源的大小对公共，研究人员甚至商业竞争对手具有很大的兴趣。估计隐藏数据源的大小是一个具有挑战性的问题。大多数现有方法源自经典捕获重新捕获方法。另一种方法是基于大查询池。由于查询池中的查询频率的大方差，此方法不准确。针对此问题，我们提出了一种新方法来通过从目标数据源的样本构建查询池来减少方差，以便降低文档频率方差，但大多数文档都可以覆盖。我们的方法在各种大型文本语料库上进行了测试，并且优于基线随机查询方法和Broder等人在所有数据集上的估计方法。

著录项

来源
《IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining》|2014年|712-719|共8页
会议地点
作者
Wang Yan; Liang Jie; Lu Jianguo;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Dictionaries; Educationalinstitutions; Estimation; Indexes; Measurement; Nickel; Hiddendatasource; documentfrequency; estimator; pool-basedsampling;

机译：字典;教育机构;估计;索引;测量;镍;隐藏数据源;文档频率;估计量;基于池的采样;

相似文献

外文文献
中文文献
专利

1. Estimating the sizes of populations at risk of HIV infection from multiple data sources using a Bayesian hierarchical model [J] . Bao Le, Raftery Adrian E., Reddy Amala Statistics and Its Interface . 2015,第2期

机译：使用贝叶斯层次模型从多个数据源估算有感染艾滋病毒风险的人口规模
2. Estimating the sizes of populations at risk of HIV infection from multiple data sources using a Bayesian hierarchical model [J] . Le Bao, Adrian E. Raftery, Amala Reddy Statistics and Its Interface . 2015,第2期

机译：使用贝叶斯层次模型从多个数据源估算有感染艾滋病毒风险的人口规模
3. Estimating the Size of Hidden Populations Using Respondent-driven Sampling Data Case Examples from Morocco [J] . Johnston Lisa G., McLaughlin Katherine R., El Rhilani Houssine, Epidemiology . 2015,第6期

机译：使用来自摩洛哥的调查对象驱动的抽样数据案例，估算隐藏人口的规模
4. Estimating the size of hidden data sources by queries [C] . Wang Yan, Liang Jie, Lu Jianguo IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining . 2014

机译：通过查询估算隐藏数据源的大小
5. Efficient Processing of Skyline Queries on Static Data Sources, Data Streams and Incomplete Datasets. [D] . Nagendra, Mithila. 2014

机译：有效处理静态数据源，数据流和不完整数据集上的天际线查询。
6. Estimating the Sizes of Populations At Risk of HIV Infection From Multiple Data Sources Using a Bayesian Hierarchical Model [O] . Le Bao, Adrian E. Raftery, Amala Reddy -1

机译：使用贝叶斯层次模型从多个数据源估计有感染艾滋病毒风险的人口规模
7. Estimating query result sizes for proxy caching in scientific database federations [O] . Tanu Malik, Al Burns, Nitesh V. Chawla, 2006

机译：估计科学数据库联盟中代理缓存的查询结果大小

Estimating the size of hidden data sources by queries

摘要

著录项

相似文献

相关主题

期刊订阅