首页> 外文会议>ACM international conference on information and knowledge management >CoDet: Sentence-based Containment Detection in News Corpora
【24h】

CoDet: Sentence-based Containment Detection in News Corpora

机译:CODET:新闻集团中基于句子的遏制检测

获取原文

摘要

We study a generalized version of the near-duplicate detection problem which concerns whether a document is a subset of another document. In text-based applications, document containment can be observed in exact-duplicates, near-duplicates, or containments, where the first two are special cases of the third. We introduce a novel method, called CoDet, which focuses particularly on this problem, and compare its performance with four well-known near-duplicate detection methods (DSC, full fingerprinting. I-Match, and SimHash) that are adapted to containment detection. Our method is expandable to different domains, and especially suitable for streaming news. Experimental results show that CoDet effectively and efficiently produces remarkable results in detecting containments.
机译:我们研究了近重复检测问题的广义版本,涉及文档是否是另一个文档的子集。在基于文本的应用中,可以在精确复制,近两者或遏制中观察到文档遏制,前两个是第三种特殊情况。我们介绍了一种称为CODET的新方法,尤其侧重于此问题,并将其性能与四个众所周知的近重复检测方法(DSC,完全指纹识别。I-Match和Simhash)进行比较,它们适应遏制检测。我们的方法可扩展到不同的域,特别适用于流新闻。实验结果表明,CODET有效,有效地产生了检测遏制的显着结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号