首页> 外文期刊>Journal of the American Society for Information Science and Technology >Managing Deja Vu: Collection building for the identification of nonidentical duplicate documents
【24h】

Managing Deja Vu: Collection building for the identification of nonidentical duplicate documents

机译:管理Deja Vu:收集建筑物以识别不相同的重复文件

获取原文
获取原文并翻译 | 示例
           

摘要

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close variants. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its deleterious effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling nonidentical duplicate documents. We subsequently examine a flexible method of characterizing and comparing documents to permit the identification of near duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts.
机译:随着在线文档收集在Web和专有环境中的不断扩展,对重复检测的需求变得越来越重要。很少有用户希望检索由重复文档集组成的搜索结果,无论是相同的重复文档还是近似的变体。这项工作的目的是促进(a)对几乎重复的现象进行调查,以及(b)尽量减少其对搜索结果的有害影响的算法方法。利用客户用户和专业搜索者的专业知识,我们建立了有原则的方法来生成测试集,以识别和处理不相同的重复文档。随后,我们研究了一种表征和比较文档的灵活方法,可以识别几乎重复的文档。在使用领域专家创建的基于生产的测试集合进行广泛评估之后,此方法已产生了令人鼓舞的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号