【24h】

When one sample is not enough

机译:当一个样本不足时

获取原文

摘要

Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the database contents, which are not typically exported by databases. Previous research has developed algorithms for constructing an approximate content summary of a text database from a small document sample extracted via querying. Unfortunately, Zipf's law practically guarantees that content summaries built this way for any relatively large database will fail to cover many low-frequency words. Incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To improve the coverage of approximate content summaries, we build on the observation that topically similar databases tend to have related vocabularies. Therefore, the approximate content summaries of topically related databases can complement each other and increase their coverage. Specifically, we exploit a (given or derived) hierarchical categorization of the databases and adapt the notion of "shrinkage" -a form of smoothing that has been used successfully for document classification-to the content summary construction task. A thorough evaluation over 315 real web databases as well as over TREC data suggests that the shrinkage-based content summaries are substantially more complete than their "unshrunk" counterparts. We also describe how to modify existing database selection algorithms to adaptively decide -at run-time-whether to apply shrinkage for a query. Our experiments, which rely on TREC data sets, queries, and the associated "relevance judgments," show that our shrinkage-based approach significantly improves state-of-the-art database selection algorithms, and also outperforms a recently proposed hierarchical strategy that exploits database classification as well.
机译:在大量分布式文本数据库中进行搜索时,数据库选择是重要的一步。数据库选择任务依赖于数据库内容的统计摘要,这些摘要通常不由数据库导出。先前的研究已经开发了用于从通过查询提取的小文档样本中构建文本数据库的近似内容摘要的算法。不幸的是,Zipf的定律实际上保证了以这种方式为任何相对较大的数据库构建的内容摘要将无法覆盖许多低频词。内容摘要不完整可能会对数据库选择过程产生负面影响,尤其是对于单词很少的简短查询。为了提高近似内容摘要的覆盖范围,我们基于以下观察结果:局部相似的数据库往往具有相关的词汇表。因此,局部相关数据库的近似内容摘要可以相互补充,并增加其覆盖范围。具体来说,我们利用数据库的(给定或派生)分层分类,并将“收缩”的概念(一种已成功用于文档分类的平滑形式)适应内容摘要构建任务。对315个真实Web数据库以及TREC数据的全面评估表明,基于收缩的内容摘要比其“未收缩”的内容摘要要完整得多。我们还将描述如何修改现有的数据库选择算法,以在运行时自适应地决定是否对查询应用收缩。我们的实验依靠TREC数据集,查询和相关的“相关性判断”,表明我们基于收缩的方法显着改善了最新的数据库选择算法,并且优于最近提出的利用数据库分类也是如此。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号