首页> 外文期刊>Computing and informatics >EVALUATION AND IMPLEMENTATION OF N-GRAM-BASED ALGORITHM FOR FAST TEXT COMPARISON
【24h】

EVALUATION AND IMPLEMENTATION OF N-GRAM-BASED ALGORITHM FOR FAST TEXT COMPARISON

机译:基于N-GRAM的快速文本比较算法的评估与实现

获取原文
获取原文并翻译 | 示例

摘要

This paper presents a study of an n-gram-based document comparison method. The method is intended to build a large-scale plagiarism detection system.The work focuses not only on an efficiency of the text similarity extractionbut also on the execution performance of the implemented algorithms. We took notice of detection performance, storage requirements and execution time of the proposed approach. The obtained results show the trade-offs between detection quality and computational requirements. The GPGPU and multi-CPU platforms were considered to implement the algorithms and to achieve good execution speed. The method consists of two main algorithms: a document's feature extraction and fast text comparison. The winnowing algorithm is used to generate a compressed representation of the analyzed documents. The authors designed and implemented a dedicated test framework for the algorithm. That allowed for the tuning, evaluation, and optimization of the parameters. Well-known metrics (e.g. precision, re-call) were used to evaluate detection performance. The authors conducted the tests to determine the performance of the winnowing algorithm for obfuscated and unobfuscated texts for a different window and n-gram size. Also, a simplified version of the text comparison algorithm was proposed and evaluated to reduce the computationalcomplexity of the text comparison process. The paper also presents GPGPU and multi-CPU implementations of the algorithms for different data structures. The implementation speed was tested for different algorithms' parameters and the size of data. The scalability of the algorithm on multi-CPU platforms was verified. The authors of the paper provide the repository of software tools and programs used to perform the conducted experiments.
机译:本文提出了一种基于n元语法的文档比较方法的研究。该方法旨在构建大规模的窃检测系统。该工作不仅着重于文本相似度提取的效率,而且着重于所实现算法的执行性能。我们注意到了所提出方法的检测性能,存储要求和执行时间。获得的结果表明了检测质量和计算要求之间的权衡。可以考虑使用GPGPU和多CPU平台来实现算法并实现良好的执行速度。该方法由两个主要算法组成:文档的特征提取和快速文本比较。风选算法用于生成分析文档的压缩表示。作者为该算法设计并实现了专用的测试框架。这样就可以进行参数的调整,评估和优化。众所周知的指标(例如精度,重调用)用于评估检测性能。作者进行了测试,以确定了针对不同窗口和n-gram大小的混淆文本和未混淆文本的风选算法的性能。此外,提出并简化了文本比较算法的版本,并进行了评估,以减少文本比较过程的计算复杂性。本文还介绍了针对不同数据结构的算法的GPGPU和多CPU实现。测试了不同算法参数和数据大小的实现速度。验证了该算法在多CPU平台上的可扩展性。该论文的作者提供了用于执行所进行实验的软件工具和程序的存储库。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号