【24h】

On the Mono- and Cross-Language Detection of Text Reuse and Plagiarism

机译:文本重用和Pla窃的单语言和跨语言检测

获取原文
获取原文并翻译 | 示例

摘要

Plagiarism, the unacknowledged reuse of text, has increased in recent years due to the large amount of texts readily available. For instance, recent studies claim that nowadays a high rate of student reports include plagiarism, making manual plagiarism detection practically infeasible. Automatic plagiarism detection tools assist experts to analyse documents for plagiarism. Nevertheless, the lack of standard collections with cases of plagiarism has prevented accurate comparing models, making differences hard to appreciate. Seminal efforts on the detection of text reuse [2] have fostered the composition of standard resources for the accurate evaluation and comparison of methods. The aim of this PhD thesis is to address three of the main problems in the development of better models for automatic plagiarism detection: (i) the adequate identification of good potential sources for a given suspicious text; (ii) the detection of plagiarism despite modifications, such as words substitution and paraphrasing (special stress is given to cross-language plagiarism); and (Hi) the generation of standard collections of cases of plagiarism and text reuse in order to provide a framework for accurate comparison of models. Regarding difficulties (i) and (ii) , we have carried out preliminary experiments over the METER corpus [2]. Given a suspicious document dq and a collection of potential source documents D, the process is divided in two steps. First, a small subset of potential source documents D* C D is retrieved. The documents d £ D* are the most related to dq and, therefore, the most likely to include the source of the plagiarised fragments in it. We performed this stage on the basis of the Kullback-Leibler distance, over a sub-sample of document's vocabularies. Afterwards, a detailed analysis is carried out comparing dq to every d € D* in order to identify potential cases of plagiarism and their source. This comparison was made on the basis of word n-grams, by considering n = {2,3}. These n-gram levels are flexible enough to properly retrieve plagiarised fragments and their sources despite modifications [1]. The result is offered to the user to take the final decision. Further experiments were done in both stages in order to compare other similarity measures, such as the cosine measure, the Jaccard coefficient and diverse fingerprinting and probabilistic models. One of the main weaknesses of currently available models is that they are unable to detect cross-language plagiarism. Approaching the detection of this kind of plagiarism is of high relevance, as the most of the information published is written in English, and authors in other languages may find it attractive to make use of direct translations. Our experiments, carried out over parallel and a comparable corpora, show that models of "standard" cross-language information retrieval are not enough. In fact, if the analysed source and target languages are related in some way (common linguistic ancestors or technical vocabulary), a simple comparison based on character n-grams seems to be the option. However, in those cases where the relation between the implied languages is weaker, other models, such as those based on statistical machine translation, are necessary [3]. We plan to perform further experiments, mainly to approach the detection of cross-language plagiarism. In order to do that, we will use the corpora developed under the framework of the PAN competition on plagiarism detection (cf. PANQCLEF: ). Models that consider cross-language thesauri and comparison of cognates will also be applied.
机译:抄袭是一种未经认可的文本重用,近年来由于大量可用的文本而有所增加。例如,最近的研究声称,如今学生报告的数量很高,包括窃,使得手动manual窃检测几乎是不可行的。自动窃检测工具可帮助专家分析文档以进行窃。但是,由于缺乏抄袭案件的标准馆藏,因此无法进行准确的比较模型,从而难以理解差异。在检测文本重用方面的开创性工作[2]促进了标准资源的组成,以便准确评估和比较方法。本博士论文的目的是解决在开发更好的自动gi窃检测模型中的三个主要问题:(i)对于给定的可疑文本,充分识别良好的潜在来源; (ii)尽管进行了修改,例如单词替换和措辞(特别注意跨语言的gi窃),但仍进行the窃的检测; (Hi)生成窃和文本重用案例的标准集合,以便为模型的准确比较提供框架。关于困难(i)和(ii),我们对METER语料库进行了初步实验[2]。给定可疑文档dq和潜在源文档D的集合,该过程分为两个步骤。首先,检索潜在源文档D * C D的一小部分。文件d£D *与dq最为相关,因此最有可能在其中包含the窃片段的来源。在文档词汇表的子样本上,我们基于Kullback-Leibler距离执行了此阶段。之后,进行了详细分析,将dq与每个d€D *进行比较,以确定潜在的抄袭病例及其来源。通过考虑n = {2,3},基于单词n-gram进行此比较。这些n-gram级别具有足够的灵活性,尽管进行了修改[1],也可以正确地检索窃的片段及其来源。将结果提供给用户以做出最终决定。为了比较其他相似性度量,例如余弦度量,Jaccard系数以及各种指纹和概率模型,在两个阶段都进行了进一步的实验。当前可用模型的主要缺点之一是它们无法检测到跨语言抄袭。由于大多数已出版的信息都是用英语编写的,因此检测这种of窃具有高度相关性,其他语言的作者可能会发现使用直接翻译很有吸引力。我们在并行语料库和可比较语料库上进行的实验表明,“标准”跨语言信息检索模型还不够。实际上,如果所分析的源语言和目标语言以某种方式关联(通用的语言祖先或技术词汇),则基于字符n-gram的简单比较似乎是一种选择。但是,在隐含语言之间的关系较弱的情况下,其他模型(例如基于统计机器翻译的模型)是必要的[3]。我们计划进行进一步的实验,主要是为了检测跨语言抄袭。为此,我们将使用在PAN竞赛框架下开发的语料库进行gi窃检测(参见PANQCLEF:)。考虑跨语言叙词表和认知比较的模型也将应用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号