首页> 外文会议>Text processing >A Fast Multi-level Plagiarism Detection Method Based on Document Embedding Representation
【24h】

A Fast Multi-level Plagiarism Detection Method Based on Document Embedding Representation

机译:基于文档嵌入表示的快速多层次Pla窃检测方法

获取原文
获取原文并翻译 | 示例

摘要

Nowadays, global networks facilitate access to vast amount of textual information and enhance the feasibility of plagiarism as a consequence. Given the amount of text material produced everyday, the need for an automated fast plagiarism detection system is more crucial than ever. Plagiarism detection is defined as identification of reused text materials. In this regard, different algorithms have been proposed to perform the task of plagiarism detection in text documents. Due to limitation in semantic representation and computational inefficiency of traditional algorithms for plagiarism detection, in this paper, we proposed an embedding based document representation to detect plagiarism in documents using a two-level decision making approach. The method is language-independent and works properly on various languages as well. In the proposed method, words are represented as multi-dimensional vectors, and simple aggregation methods are used to combine the word vectors in order to represent sentences. By comparing representations of source and suspicious sentences, sentence pairs with the highest similarity score are considered as the candidates of the plagiarism cases. The final decision whether or not the pairs are plagiarized is taken using another level of similarity calculation using Jaccard metric by comparing the word sets of two sentences. Our method has been used in PAN2016 Persian plagiarism detection contest and results in 85.8% recall, 95.9% precision and 90.6% plagdet which is a combination of the these two measures with the measure of how concretely we retrieve plagiarism cases, on the provided data sets in a short amount of time. This method achieved the second place regarding plagdet and the first rank based on runtime.
机译:如今,全球网络促进了对大量文本信息的访问,并因此提高了窃的可行性。考虑到每天产生的文字材料数量,对自动化快速窃检测系统的需求比以往任何时候都更为重要。抄袭检测被定义为对重复使用的文本材料的识别。在这方面,已经提出了不同的算法来执行文本文档中的窃检测任务。由于semantic窃检测传统算法的语义表示存在局限性和计算效率低下,本文提出了一种基于嵌入的文档表示方法,采用两级决策方法来检测文档中的窃行为。该方法与语言无关,并且也可以在各种语言上正常工作。在提出的方法中,单词被表示为多维向量,并且简单的聚合方法被用来组合单词向量以表示句子。通过比较来源句和可疑句的表示形式,将具有最高相似度得分的句子对视为gi窃案例的候选对象。通过比较两个句子的单词集,使用Jaccard度量标准使用相似度计算的另一个级别,来决定是否对这两个单词窃。我们的方法已在PAN2016波斯窃检测比赛中使用,其召回率达到85.8%,准确率达到95.9%,抓举率达到90.6%,这是这两种措施的结合,并结合我们在提供的数据集上对retrieve窃案件进行具体检索的程度在短时间内。该方法在运行时方面获得了关于plagdet的第二名和第一名。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号