首页> 外文会议>International Conference on String Processing and Information Retrieval >Compact Features for Detection of Near-Duplicates in Distributed Retrieval
【24h】

Compact Features for Detection of Near-Duplicates in Distributed Retrieval

机译:用于检测分布式检索中近复制的紧凑功能

获取原文

摘要

In distributed information retrieval, answers from separate collections are combined into a single result set. However, the collections may overlap. The fact that the collections are distributed means that it is not in general feasible to prune duplicate and near-duplicate documents at index time. In this paper we introduce and analyze the grainy hash vector, a compact document representation that can be used to efficiently prune duplicate and near-duplicate documents from result lists. We demonstrate that, for a modest bandwidth and computational cost, many near-duplicates can be accurately removed from result lists produced by a cooperative distributed information retrieval system.
机译:在分布式信息检索中,单独集合的答案组合成单个结果集。但是,集合可能重叠。该集合是分布式的事实意味着在索引时间上修剪重复和近重复文档并不是一般的可行性。在本文中,我们介绍和分析了颗粒状哈希向量,这是一个紧凑的文件表示,可用于有效地从结果列表中进行重复和近重复文档。我们证明,对于适度的带宽和计算成本,可以从协同分布式信息检索系统产生的结果列表中精确地移除许多近副本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号