首页> 外文OA文献 >A study on plagiarism detection and plagiarism direction identification using natural language processing techniques
【2h】

A study on plagiarism detection and plagiarism direction identification using natural language processing techniques

机译:利用自然语言处理技术进行抄袭检测和抄袭方向识别的研究

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Ever since we entered the digital communication era, the ease of information sharing through the internet has encouraged online literature searching. With this comes the potential risk of a rise in academic misconduct and intellectual property theft. As concerns over plagiarism grow, more attention has been directed towards automatic plagiarism detection. This is a computational approach which assists humans in judging whether pieces of texts are plagiarised. However, most existing plagiarism detection approaches are limited to super cial, brute-force stringmatching techniques. If the text has undergone substantial semantic and syntactic changes, string-matching approaches do not perform well. In order to identify such changes, linguistic techniques which are able to perform a deeper analysis of the text are needed. To date, very limited research has been conducted on the topic of utilising linguistic techniques in plagiarism detection. This thesis provides novel perspectives on plagiarism detection and plagiarism direction identi cation tasks. The hypothesis is that original texts and rewritten texts exhibit signi cant but measurable di erences, and that these di erences can be captured through statistical and linguistic indicators. To investigate this hypothesis, four main research objectives are de ned. First, a novel framework for plagiarism detection is proposed. It involves the use of Natural Language Processing techniques, rather than only relying on the vii traditional string-matching approaches. The objective is to investigate and evaluate the in uence of text pre-processing, and statistical, shallow and deep linguistic techniques using a corpus-based approach. This is achieved by evaluating the techniques in two main experimental settings. Second, the role of machine learning in this novel framework is investigated. The objective is to determine whether the application of machine learning in the plagiarism detection task is helpful. This is achieved by comparing a thresholdsetting approach against a supervised machine learning classi er. Third, the prospect of applying the proposed framework in a large-scale scenario is explored. The objective is to investigate the scalability of the proposed framework and algorithms. This is achieved by experimenting with a large-scale corpus in three stages. The rst two stages are based on longer text lengths and the nal stage is based on segments of texts. Finally, the plagiarism direction identi cation problem is explored as supervised machine learning classi cation and ranking tasks. Statistical and linguistic features are investigated individually or in various combinations. The objective is to introduce a new perspective on the traditional brute-force pair-wise comparison of texts. Instead of comparing original texts against rewritten texts, features are drawn based on traits of texts to build a pattern for original and rewritten texts. Thus, the classi cation or ranking task is to t a piece of text into a pattern. The framework is tested by empirical experiments, and the results from initial experiments show that deep linguistic analysis contributes to solving the problems we address in this thesis. Further experiments show that combining shallow and viii deep techniques helps improve the classi cation of plagiarised texts by reducing the number of false negatives. In addition, the experiment on plagiarism direction detection shows that rewritten texts can be identi ed by statistical and linguistic traits. The conclusions of this study o er ideas for further research directions and potential applications to tackle the challenges that lie ahead in detecting text reuse.
机译:自从我们进入数字通信时代以来,通过Internet进行信息共享的便利性一直鼓励在线文献搜索。随之而来的是潜在的学术不端行为和知识产权盗窃行为上升的风险。随着对窃问题的关注日益增长,更多的注意力已经转向自动automatic窃检测。这是一种计算方法,可帮助人们判断文本是否抄袭。但是,大多数现有的抄袭检测方法仅限于表面暴力的字符串匹配技术。如果文本发生了实质性的语义和句法更改,则字符串匹配方法将无法很好地执行。为了识别这种变化,需要能够对文本进行更深入分析的语言技术。迄今为止,关于在抄袭检测中使用语言技术的主题的研究非常有限。本文为on窃检测和窃方向识别任务提供了新颖的见解。假设是原始文本和重写文本显示出显着但可测量的差异,并且可以通过统计和语言指标捕获这些差异。为了研究这个假设,定义了四个主要的研究目标。首先,提出了一种新的framework窃检测框架。它涉及自然语言处理技术的使用,而不仅仅是依靠vii传统的字符串匹配方法。目的是研究和评估使用基于语料库的方法对文本预处理以及统计,浅层和深层语言技术的影响。这是通过在两个主要实验环境中评估技术来实现的。其次,研究了机器学习在这个新颖框架中的作用。目的是确定机器学习在the窃检测任务中的应用是否有帮助。这是通过将阈值设置方法与有监督的机器学习分类器进行比较来实现的。第三,探索了在大规模场景中应用所提出的框架的前景。目的是研究所提出的框架和算法的可扩展性。这可以通过在三个阶段中对大型语料库进行实验来实现。前两个阶段基于较长的文本长度,最后一个阶段基于文本段。最后,the窃方向识别问题作为有监督的机器学习分类和排名任务而被探索。统计和语言特征可以单独或以各种组合方式进行研究。目的是为传统的暴力文本成对比较引入一种新观点。不是将原始文本与重写文本进行比较,而是根据文本的特性绘制特征,以构建原始文本和重写文本的模式。因此,分类或排序任务是将一段文本转换为一个模式。通过实证实验对框架进行了测试,初步实验结果表明,深入的语言分析有助于解决本文所要解决的问题。进一步的实验表明,浅层和深层技术的结合有助于通过减少假阴性的数量来提高improve窃文本的分类。此外,窃方向检测实验表明,重写的文本可以通过统计和语言特征来识别。本研究的结论为进一步研究方向和潜在应用提供了思路,以解决检测文本重用方面的挑战。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号