Text plagiarism classification using syntax based linguistic features

Vani K.; Gupta Deepa

首页> 外文期刊>Expert Systems with Application >Text plagiarism classification using syntax based linguistic features

【24h】

Text plagiarism classification using syntax based linguistic features

机译：使用基于语法的语言特征进行文本窃分类

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The proposed work models document level text plagiarism detection as a binary classification problem, where the task is to distinguish a given suspicious-source document pair as plagiarized or non plagiarized. The objective is to explore the potency of syntax based linguistic features extracted using shallow natural language processing techniques for plagiarism classification task. Shallow syntactic features, viz., part of speech tags and chunks are utilized after effective pre-processing and filtrations for pruning the irrelevant information. The work further proposes the modelling of this classification phase as an intermediate stage, which will be post candidate source retrieval and before exhaustive passage level detections. A two-phase feature selection approach is proposed, which improves the effectiveness of classification by selecting appropriate set of features as the input to machine learning based classifiers. The proposed approach is evaluated on smaller and larger test conditions using the corpus of plagiarized short answers (PSA) and plagiarism instances collected from PAN corpus respectively. Under both the test conditions, performances are evaluated using general as well as advanced classification metrics. Another main contribution of the current work is the analysis of dependencies and impact of the extracted features, upon the type and complexity of plagiarism imposed in the documents. The proposed results are compared with the two state-of-the-art approaches and they outperform the baseline approaches significantly. This in turn reflects the cogency of syntactic linguistic features in document level plagiarism classification, especially for the instances close to manual or real plagiarism scenarios. (C) 2017 Elsevier Ltd. All rights reserved.

机译：拟议的工作模型将文档级文本窃检测作为二进制分类问题，其任务是将给定的可疑源文档对区分为窃还是非窃。目的是探索使用浅层自然语言处理技术提取的基于语法的语言特征对窃分类任务的作用。在有效的预处理和过滤后，使用浅层语法特征（即部分语音标签和块）来修剪不相关的信息。这项工作还建议将此分类阶段建模为一个中间阶段，该阶段将是候选源检索之后以及详尽的通过级别检测之前。提出了一种两阶段特征选择方法，该方法通过选择适当的特征集作为基于机器学习的分类器的输入来提高分类的有效性。分别使用using窃简短答案（PSA）语料库和从PAN语料库收集的窃实例，在较小和较大的测试条件下对提出的方法进行了评估。在两种测试条件下，均使用常规以及高级分类指标来评估性能。当前工作的另一个主要贡献是，对文档中抄袭的类型和复杂性的提取特征的依赖性和影响进行了分析。将拟议的结果与两种最新方法进行了比较，它们明显优于基线方法。这反过来反映了文档级窃分类中句法语言功能的优势，尤其是对于接近手动或真实窃情形的情况。（C）2017 Elsevier Ltd.保留所有权利。

著录项

来源
《Expert Systems with Application》 |2017年第12期|448-464|共17页
作者
Vani K.; Gupta Deepa;
展开▼
作者单位

Amrita Univ, Amrita Vishwa Vidyapeetham, Amrita Sch Engn, Dept Comp Sci & Engn, Bengaluru, India;

Amrita Univ, Amrita Vishwa Vidyapeetham, Amrita Sch Engn, Dept Math, Bengaluru, India;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Plagiarism classification; Syntactic features; Linguistic features; POS tags; Chunks;

机译：gi窃分类;句法特征;语言特征;POS标签;大块;

相似文献

外文文献
中文文献
专利

1. Integrating Syntax-Semantic-Based Text Analysis With Structural and Citation Information for Scientific Plagiarism Detection [J] . Vani K., Gupta Deepa Journal of the American Society for Information Science and Technology . 2018,第11期

机译：将基于句法语义的文本分析与结构和引文信息相结合，以进行科学的gi窃检测
2. A text-based approach to feature modelling: Syntax and semantics of TVL [J] . Andreas Classen, Quentin Boucher, Patrick Heymans Science of Computer Programming . 2011,第12期

机译：基于文本的特征建模方法：TVL的语法和语义
3. Original Research - Special Collection: Qumran Texts The contribution of Qumran to historical Hebrew linguistics: Evidence from the syntax of participial negation Crossref Citations [J] . Cynthia L. Miller-Naudé, Jacobus A. Naudé HTS Teologiese Studies/Theological Studies . 2016,第4期

机译：原始研究-特殊收藏：Qumran文本Qumran对希伯来历史语言学的贡献：参与式否定的语法证据Crossref引用
4. (899.pdf) AUTOMATIC CLASSIFICATION OF TEXT WRITTEN BY EFL LEARNERS BASED ON LINGUISTIC FEATURES AND LEARNER FEATURES [C] . Katsunori Kotani, Takehiko Yoshimi, Mayumi Uchida International Technology, Education and Development Conference . 2013

机译：（899.pdf）基于语言特征和学习者功能的EFL学习者自动分类文本
5. Detecting and Analyzing Cybercrime in Text-based Communication of Cybercriminal Networks Through Computational Linguistic and Psycholinguistic Feature Modeling. [D] . Mbaziira, Alex Vincent. 2017

机译：通过计算语言和心理语言特征建模，在基于文本的网络犯罪网络通信中检测和分析网络犯罪。
6. Random Steinhaus Distances for Robust Syntax-Based Classification of Partially Inconsistent Linguistic Data [O] . Laura Franzoi, Andrea Sgarro, Anca Dinu, -1

机译：基于Steinhaus距离的基于语法的部分不一致语言数据的鲁棒分类
7. Experiments to investigate the utility of nearest neighbour metrics based on linguistically informed features for detecting textual plagiarism [O] . Almquist Per, Karlgren Jussi 2011

机译：实验研究基于语言知悉特征的最近邻指标在检测文本抄袭中的效用

Text plagiarism classification using syntax based linguistic features

摘要

著录项

相似文献

相关主题

期刊订阅