首页> 外文会议>International conference on Asian language processing >Compiling a text re-use detection corpus from scientific papers with semi-real cases of plagiarism
【24h】

Compiling a text re-use detection corpus from scientific papers with semi-real cases of plagiarism

机译:从具有半真实with窃案例的科学论文中编写文本重用检测语料库

获取原文

摘要

Automatic plagiarism detection deals with retrieval of reused fragment of texts in a document and finding source documents. Due to development of various methods for plagiarism detection, large scale plagiarism corpora are needed to evaluate these methods. Despite of their importance, few plagiarism detection corpora developed in recent years, especially in low resource languages. Because of legal issues, releasing a collection of real cases of plagiarism for evaluation purposes is not ethical. Due to these limitations, simulation and artificial based methods are the two main approaches to compile a plagiarism corpus. These approaches try to simulate real cases of plagiarism, from different point of views. However, there are still fundamental differences between simulated corpora and real cases of plagiarism. In this paper a semi-real approach is proposed to create a collection of plagiarism cases as a corpus. This approach is based on eliminating correct references from scientific papers to make them as plagiarized passages. Unlike methods based on simulated and artificial approaches, the proposed corpus can correctly simulate real cases of text re-use. The evaluation result shows high accuracy of proposed corpus with respect to n-gram similarity for different ranges of N.
机译:自动窃检测处理检索文档中文本的重用片段并查找源文档。由于各种窃检测方法的发展,需要大规模的窃语料库来评估这些方法。尽管它们很重要,但是近年来few窃检测语料库很少,特别是在资源匮乏的语言中。由于法律问题,出于评估目的发布真实的of窃案例集合是不道德的。由于这些限制,模拟和基于人工的方法是编译窃语料库的两种主要方法。这些方法试图从不同的角度模拟real窃的真实案例。但是,模拟语料库与真实抄袭案例之间仍然存在根本差异。在本文中,提出了一种半真实的方法来创建窃案例集作为语料库。这种方法是基于从科学论文中删除正确的参考文献,以使它们成为窃的段落。与基于模拟和人工方法的方法不同,建议的语料库可以正确模拟文本重用的实际情况。评估结果表明,对于不同范围的N,建议的语料库对于n-gram相似度的准确性很高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号