首页> 外文会议>String Processing and Information Retrieval; Lecture Notes in Computer Science; 4209 >Compact Features for Detection of Near-Duplicates in Distributed Retrieval
【24h】

Compact Features for Detection of Near-Duplicates in Distributed Retrieval

机译:用于分布式检索中近重复项检测的紧凑功能

获取原文
获取原文并翻译 | 示例

摘要

In distributed information retrieval, answers from separate collections are combined into a single result set. However, the collections may overlap. The fact that the collections are distributed means that it is not in general feasible to prune duplicate and near-duplicate documents at index time. In this paper we introduce and analyze the grainy hash vector, a compact document representation that can be used to efficiently prune duplicate and near-duplicate documents from result lists. We demonstrate that, for a modest bandwidth and computational cost, many near-duplicates can be accurately removed from result lists produced by a cooperative distributed information retrieval system.
机译:在分布式信息检索中,来自不同集合的答案被组合到单个结果集中。但是,集合可能会重叠。集合是分布式的这一事实意味着在索引时间修剪重复的和几乎重复的文档通常是不可行的。在本文中,我们介绍并分析了粒状哈希向量,它是一种紧凑的文档表示形式,可用于有效地修剪结果列表中的重复和接近重复的文档。我们证明,对于适度的带宽和计算成本,可以从协作式分布式信息检索系统产生的结果列表中准确删除许多重复项。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号