首页> 外文期刊>ACM SIGIR FORUM >A Repetition Based Measure for Verification of Text Collections and for Text Categorization
【24h】

A Repetition Based Measure for Verification of Text Collections and for Text Categorization

机译:用于验证文本集合和文本分类的基于重复的度量

获取原文
获取原文并翻译 | 示例
           

摘要

We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively computed using the suffix array data structure. Additionally, the computation procedure can be improved to locate the sets of duplicate or plagiarised documents. We applied the technique to several standard text collections and found that they contained a significant number of duplicate and plagiarised documents. Another reformulation of the method leads to an algorithm that can be applied to supervised multi-class categorization. We illustrate the approach using the recently available Reuters Corpus Volume 1 (RCV1). The results show that the method outperforms SVM at multi-class categorization, and interestingly, that results correlate strongly with compression-based methods.
机译:我们建议一种使用R度量来定位文本集合中重复项和抄袭的方法,该度量是在集合的其他文档中重复的所有文本后缀的长度的归一化总和。使用后缀数组数据结构可以有效地计算R度量。另外,可以改进计算过程以定位重复的或抄袭的文档的集合。我们将该技术应用于几个标准的文本集,发现它们包含大量重复和窃的文档。该方法的另一种重构形式导致了一种可以应用于监督式多类分类的算法。我们使用最近可用的路透社语料库第1卷(RCV1)说明这种方法。结果表明,该方法在多类分类方面优于SVM,有趣的是,该结果与基于压缩的方法密切相关。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号