首页> 外文期刊>Expert systems with applications >Cross-language text alignment: A proposed two-level matching scheme for plagiarism detection
【24h】

Cross-language text alignment: A proposed two-level matching scheme for plagiarism detection

机译:跨语言文本对齐:提出的抄袭检测的​​两级匹配方案

获取原文
获取原文并翻译 | 示例

摘要

The exponential growth of documents in various languages throughout the web, along with the availability of several editing and translation tools have made the cross-language plagiarism detection a challenging issue. Regarding its high importance, the present study focuses on the task of cross-language text alignment also known as detailed analysis which works on the outputs of the source retrieval step of cross language plagiarism detection systems. The paper proposes a two-level matching approach with the aim of considering both syntactic and semantic information to align plagiarism fragments from the source and suspicious documents, accurately. At the first level, a vector space model which employs a multilingual word embeddings based dictionary and a local weighting technique is used in order to extract a minimal set of highly potential candidate fragment pairs rather than considering all possible pairs of fragments. This step also contains a dynamic expansion technique to cover more candidate pairs aiming at improving the system's recall. It is followed by a more precise algorithm that examines the candidate pairs at the sentence level using a graph-of-words representation of text. As a result, by modelling both the words and their relationships, an acceptable increase in the system's precision which is the goal of the second level is also observed. To identify evidence of plagiarism, i.e. potential cases of unauthorized text reuse, the algorithm tries to find maximum cliques from the match graph of source and suspicious texts. With this two-level investigation, the approach is capable to discriminate true plagiarism cases from the original text. The experimental results on different datasets such as PAN-PC-11, PAN-PC-12, and SemEval-2017 show that the proposed cross-language text alignment approach significantly outperforms the state-of-the-art models and can be fed into an expert system for further improvement of cross-language plagiarism detection. The source codes are publicly available on GitHub1, for the purposes of reproducible research. (c) 2020 Elsevier Ltd. All rights reserved.
机译:整个网络中各种语言的文件的指数增长以及多种编辑和翻译工具的可用性使得跨语言抄袭检测成为一个具有挑战性的问题。关于其高度重要性,本研究重点介绍了跨语言文本对齐的任务,也称为详细分析,其适用于跨语言抄袭检测系统的源检索步骤的输出。本文提出了一种两级匹配方法,目的是考虑句法和语义信息,准确地将源和可疑文件与源和可疑文件对齐抄袭碎片。在第一级别,使用基于多语言单词嵌入的字典和本地加权技术的矢量空间模型,以便提取最小的高潜在候选片段对,而不是考虑所有可能的片段。该步骤还包含动态扩展技术,以涵盖更多候选对,旨在改善系统的召回。之后是一种更精确的算法,使用文本的单词表示,检查句子级别的候选对。结果,通过建模单词及其关系,也观察到系统精度的可接受的增加,这是第二级的目标。为了识别抄袭的证据,即潜在的未经授权的文本重用案例,算法试图从源和可疑文本的匹配图中找到最大的派系。凭借这种两级调查,该方法能够歧视真实文本的真实抄袭病例。不同数据集的实验结果如Pan-PC-11,Pan-PC-12和Semeval-2017,表明所提出的跨语言文本对齐方法显着优于最先进的模型,并可以进入进一步改善跨语言抄袭检测的​​专家系统。源代码在Github1上公开可用,以便可重复研究。 (c)2020 elestvier有限公司保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号