首页> 外文期刊>Expert Systems with Application >Text plagiarism classification using syntax based linguistic features
【24h】

Text plagiarism classification using syntax based linguistic features

机译:使用基于语法的语言特征进行文本窃分类

获取原文
获取原文并翻译 | 示例
       

摘要

The proposed work models document level text plagiarism detection as a binary classification problem, where the task is to distinguish a given suspicious-source document pair as plagiarized or non plagiarized. The objective is to explore the potency of syntax based linguistic features extracted using shallow natural language processing techniques for plagiarism classification task. Shallow syntactic features, viz., part of speech tags and chunks are utilized after effective pre-processing and filtrations for pruning the irrelevant information. The work further proposes the modelling of this classification phase as an intermediate stage, which will be post candidate source retrieval and before exhaustive passage level detections. A two-phase feature selection approach is proposed, which improves the effectiveness of classification by selecting appropriate set of features as the input to machine learning based classifiers. The proposed approach is evaluated on smaller and larger test conditions using the corpus of plagiarized short answers (PSA) and plagiarism instances collected from PAN corpus respectively. Under both the test conditions, performances are evaluated using general as well as advanced classification metrics. Another main contribution of the current work is the analysis of dependencies and impact of the extracted features, upon the type and complexity of plagiarism imposed in the documents. The proposed results are compared with the two state-of-the-art approaches and they outperform the baseline approaches significantly. This in turn reflects the cogency of syntactic linguistic features in document level plagiarism classification, especially for the instances close to manual or real plagiarism scenarios. (C) 2017 Elsevier Ltd. All rights reserved.
机译:拟议的工作模型将文档级文本窃检测作为二进制分类问题,其任务是将给定的可疑源文档对区分为窃还是非窃。目的是探索使用浅层自然语言处理技术提取的基于语法的语言特征对窃分类任务的作用。在有效的预处理和过滤后,使用浅层语法特征(即部分语音标签和块)来修剪不相关的信息。这项工作还建议将此分类阶段建模为一个中间阶段,该阶段将是候选源检索之后以及详尽的通过级别检测之前。提出了一种两阶段特征选择方法,该方法通过选择适当的特征集作为基于机器学习的分类器的输入来提高分类的有效性。分别使用using窃简短答案(PSA)语料库和从PAN语料库收集的窃实例,在较小和较大的测试条件下对提出的方法进行了评估。在两种测试条件下,均使用常规以及高级分类指标来评估性能。当前工作的另一个主要贡献是,对文档中抄袭的类型和复杂性的提取特征的依赖性和影响进行了分析。将拟议的结果与两种最新方法进行了比较,它们明显优于基线方法。这反过来反映了文档级窃分类中句法语言功能的优势,尤其是对于接近手动或真实窃情形的情况。 (C)2017 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号