首页> 外文OA文献 >Three Way Search Engine Queries with Multi-feature Document Comparison for Plagiarism Detection
【2h】

Three Way Search Engine Queries with Multi-feature Document Comparison for Plagiarism Detection

机译:具有多特征文档比较的三向搜索引擎查询,用于Pla窃检测

摘要

In this paper, we describe our approach at the PAN 2012 plagiarism detection competition. Our candidate retrieval system is based on extraction of three different types of Web queries with narrowing their execution by skipping certain passages of an input document. We have created queries based on keywords extraction, intrinsic plagiarism detection and headers extraction. We have also compared the performance of constructed queries used during the PAN 2012 test process. The proposed methodology was the best performing one in case of long term operation and also the most cost-effective one. Our detailed comparison system is based on detecting common features of several types (in the final submission, we have used two types of features: sorted word 5-grams and unsorted stop word 8-grams) in the input document pair. We propose a method of computing so called valid intervals from those features, represented by their offset and length attributes in both source and suspicious document. Previous works use the feature ordering as the measure of distance, which is not usable for multiple types of features, which do not have any natural ordering. From those valid intervals we compute final detections in the post-processing phase, where we merge neighbouring valid intervals and remove some types of overlapping detections. We further discuss other approaches which we explored, but which have not been used in our final submission. In the paper we also discuss the performance aspects of our program, parameter settings, and the relevance of current PAN 2012 rules (including the plagdet score) to the real-world plagiarism detection systems.
机译:在本文中,我们将介绍PAN 2012窃检测比赛的方法。我们的候选检索系统基于三种不同类型的Web查询的提取,并通过跳过输入文档的某些段落来缩小其执行范围。我们基于关键字提取,内在抄袭检测和标题提取创建了查询。我们还比较了PAN 2012测试过程中使用的构造查询的性能。在长期操作的情况下,所提出的方法是效果最好的方法,也是最具成本效益的方法。我们的详细比较系统基于检测输入文档对中几种类型的共同特征(在最终提交中,我们使用了两种类型的特征:5词排序的单词和8词未排序的停用词)。我们提出了一种从这些特征计算所谓的有效间隔的方法,这些有效间隔由源和可疑文档中的偏移量和长度属性表示。以前的作品使用要素排序作为距离的度量,这不适用于没有任何自然排序的多种类型的要素。从这些有效间隔中,我们在后处理阶段计算最终检测结果,在此阶段中,我们合并相邻的有效间隔并删除某些类型的重叠检测。我们将进一步讨论我们探索过的其他方法,但最终提交中并未使用这些方法。在本文中,我们还讨论了程序的性能,参数设置以及当前PAN 2012规则(包括plagdet分数)与真实窃检测系统的相关性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号