A Repetition Based Measure for Verification of Text Collections and for Text Categorization

Dmitry V. Khmelev; William J. Teahan

首页> 外文期刊>ACM SIGIR FORUM >A Repetition Based Measure for Verification of Text Collections and for Text Categorization

【24h】

A Repetition Based Measure for Verification of Text Collections and for Text Categorization

机译：用于验证文本集合和文本分类的基于重复的度量

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively computed using the suffix array data structure. Additionally, the computation procedure can be improved to locate the sets of duplicate or plagiarised documents. We applied the technique to several standard text collections and found that they contained a significant number of duplicate and plagiarised documents. Another reformulation of the method leads to an algorithm that can be applied to supervised multi-class categorization. We illustrate the approach using the recently available Reuters Corpus Volume 1 (RCV1). The results show that the method outperforms SVM at multi-class categorization, and interestingly, that results correlate strongly with compression-based methods.

机译：我们建议一种使用R度量来定位文本集合中重复项和抄袭的方法，该度量是在集合的其他文档中重复的所有文本后缀的长度的归一化总和。使用后缀数组数据结构可以有效地计算R度量。另外，可以改进计算过程以定位重复的或抄袭的文档的集合。我们将该技术应用于几个标准的文本集，发现它们包含大量重复和窃的文档。该方法的另一种重构形式导致了一种可以应用于监督式多类分类的算法。我们使用最近可用的路透社语料库第1卷（RCV1）说明这种方法。结果表明，该方法在多类分类方面优于SVM，有趣的是，该结果与基于压缩的方法密切相关。

著录项

来源
《ACM SIGIR FORUM》 |2003年第special期|p.104-110|共7页
作者
Dmitry V. Khmelev; William J. Teahan;
展开▼
作者单位

Department of Mathematics, University of Toronto Philological Department, Moscow State University;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词
text categorization; text compression; language modeling; cross-entropy;

机译：文本分类;文本压缩;语言建模;交叉熵;

相似文献

外文文献
中文文献
专利

1. Text Document Categorization using Enhanced Sentence Vector Space Model and Bi-Gram Text Representation Model Based on Novel Fusion Techniques [J] . Abdisa Demissie Amensisa New Media and Mass Communication . 2020,第4期

机译：基于新型融合技术的基于增强句子矢量空间模型和双革文本表示模型的文本文档分类
2. An enhanced text categorization method based on improved text frequency approach and mutual information algorithm [J] . Pei Zhili, Shi Xiaohu, Maurizio Marchese, 自然科学进展：英文版 . 2007,第012期

机译：基于改进文本频率法和互信息算法的改进文本分类方法
3. An enhanced text categorization method based on improved text frequency approach and mutual information algorithm [J] . 自然科学进展（英文版） . 2007,第012期

机译：基于改进文本频率法和互信息算法的改进文本分类方法
4. A Repetition Based Measure for Verification of Text Collections and for Text Categorization [C] . Dmitry V. Khmelev, William J. Teahan The Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Jul 28-Aug 1, 2003 Toronto, Canada . 2003

机译：用于验证文本集合和文本分类的基于重复的度量
5. Vocabulary acquisition through reading: Assessing the lexical composition of theme-based text collections in upper-elementary education [D] . Gardner, Dee Isaac 1999

机译：通过阅读获得词汇：评估高等基础教育中基于主题的文本集合的词汇组成
6. Biomedical Text Categorization Based on Ensemble Pruning and Optimized Topic Modelling [O] . Aytuğ Onan 2018

机译：基于集合修剪和优化主题建模的生物医学文本分类
7. Verifying a Chinese Collection for Text Categorization [O] . Yuen-hsien Tseng, William John Teahan 2004

机译：验证中文文本分类集合

A Repetition Based Measure for Verification of Text Collections and for Text Categorization

摘要

著录项

相似文献

相关主题

期刊订阅