Cross-language text alignment: A proposed two-level matching scheme for plagiarism detection

Roostaee Meysam; Fakhrahmad Seyed Mostafa; Sadreddini Mohammad Hadi

首页> 外文期刊>Expert systems with applications >Cross-language text alignment: A proposed two-level matching scheme for plagiarism detection

【24h】

Cross-language text alignment: A proposed two-level matching scheme for plagiarism detection

机译：跨语言文本对齐：提出的抄袭检测的两级匹配方案

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The exponential growth of documents in various languages throughout the web, along with the availability of several editing and translation tools have made the cross-language plagiarism detection a challenging issue. Regarding its high importance, the present study focuses on the task of cross-language text alignment also known as detailed analysis which works on the outputs of the source retrieval step of cross language plagiarism detection systems. The paper proposes a two-level matching approach with the aim of considering both syntactic and semantic information to align plagiarism fragments from the source and suspicious documents, accurately. At the first level, a vector space model which employs a multilingual word embeddings based dictionary and a local weighting technique is used in order to extract a minimal set of highly potential candidate fragment pairs rather than considering all possible pairs of fragments. This step also contains a dynamic expansion technique to cover more candidate pairs aiming at improving the system's recall. It is followed by a more precise algorithm that examines the candidate pairs at the sentence level using a graph-of-words representation of text. As a result, by modelling both the words and their relationships, an acceptable increase in the system's precision which is the goal of the second level is also observed. To identify evidence of plagiarism, i.e. potential cases of unauthorized text reuse, the algorithm tries to find maximum cliques from the match graph of source and suspicious texts. With this two-level investigation, the approach is capable to discriminate true plagiarism cases from the original text. The experimental results on different datasets such as PAN-PC-11, PAN-PC-12, and SemEval-2017 show that the proposed cross-language text alignment approach significantly outperforms the state-of-the-art models and can be fed into an expert system for further improvement of cross-language plagiarism detection. The source codes are publicly available on GitHub1, for the purposes of reproducible research. (c) 2020 Elsevier Ltd. All rights reserved.

机译：整个网络中各种语言的文件的指数增长以及多种编辑和翻译工具的可用性使得跨语言抄袭检测成为一个具有挑战性的问题。关于其高度重要性，本研究重点介绍了跨语言文本对齐的任务，也称为详细分析，其适用于跨语言抄袭检测系统的源检索步骤的输出。本文提出了一种两级匹配方法，目的是考虑句法和语义信息，准确地将源和可疑文件与源和可疑文件对齐抄袭碎片。在第一级别，使用基于多语言单词嵌入的字典和本地加权技术的矢量空间模型，以便提取最小的高潜在候选片段对，而不是考虑所有可能的片段。该步骤还包含动态扩展技术，以涵盖更多候选对，旨在改善系统的召回。之后是一种更精确的算法，使用文本的单词表示，检查句子级别的候选对。结果，通过建模单词及其关系，也观察到系统精度的可接受的增加，这是第二级的目标。为了识别抄袭的证据，即潜在的未经授权的文本重用案例，算法试图从源和可疑文本的匹配图中找到最大的派系。凭借这种两级调查，该方法能够歧视真实文本的真实抄袭病例。不同数据集的实验结果如Pan-PC-11，Pan-PC-12和Semeval-2017，表明所提出的跨语言文本对齐方法显着优于最先进的模型，并可以进入进一步改善跨语言抄袭检测的专家系统。源代码在Github1上公开可用，以便可重复研究。（c）2020 elestvier有限公司保留所有权利。

著录项

来源
《Expert systems with applications》 |2020年第12期|113718.1-113718.20|共20页
作者
Roostaee Meysam; Fakhrahmad Seyed Mostafa; Sadreddini Mohammad Hadi;
展开▼
作者单位

Shiraz Univ Sch Elect & Comp Engn Dept Comp Sci & Engn & IT Shiraz Iran;

Shiraz Univ Sch Elect & Comp Engn Dept Comp Sci & Engn & IT Shiraz Iran;

Shiraz Univ Sch Elect & Comp Engn Dept Comp Sci & Engn & IT Shiraz Iran;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Plagiarism detection; Cross-language plagiarism; Text alignment; Graph-of-words representation; Multilingual word embeddings;

机译：抄袭检测;跨语言抄袭;文本对齐;词语表示;多语种词嵌入;

相似文献

外文文献
中文文献
专利

1. An effective approach to candidate retrieval for cross-language plagiarism detection: A fusion of conceptual and keyword-based schemes [J] . Information Processing & Management . 2020,第2期

机译：一种有效的候选语言检索方法，用于跨语言approach窃检测：基于概念和基于关键字的方案的融合
2. Cross-lingual text alignment for fine-grained plagiarism detection [J] . Ehsan Nava, Shakery Azadeh, Tompa Frank Wm Journal of Information Science . 2019,第4期

机译：跨语言文本对齐，用于细粒度的gi窃检测
3. Cross-lingual text alignment for fine-grained plagiarism detection [J] . Ehsan Nava, Shakery Azadeh, Tompa Frank Wm Journal of Information Science . 2019,第4期

机译：细粒度抄袭检测的交叉语言对齐
4. On the Mono- and Cross-Language Detection of Text Reuse and Plagiarism [C] . Alberto Barron-Cedeno 33rd annual international ACM SIGIR conference on research and development in information retrieval 2010 . 2010

机译：文本重用和Pla窃的单语言和跨语言检测
5. Mono- and Cross-Lingual Paraphrased Text Reuse and Extrinsic Plagiarism Detection [D] . Sharjeel, Muhammad. 2020

机译：单次和交叉语言解读文本重用和外在抄袭检测
6. Mutual information-based template matching scheme for detection of breast masses: from mammography to digital breast tomosynthesis [O] . Maciej A Mazurowski, Joseph Y Lo, Brian P Harrawood, -1

机译：用于检测乳房肿块的基于互信息的模板匹配方案：从乳房X线摄影到数字乳房爆炸
7. On the mono- and cross-language detection of text reuse and plagiarism [O] . Alberto Barrón-Cedeño 2010

机译：关于文本重用和抄袭的单声道和跨语言检测

Cross-language text alignment: A proposed two-level matching scheme for plagiarism detection

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅