A Global Dictionary Based Approach to Fast Similar Text Search in Document Repository

机译：基于全局字典的文档库中相似文本快速搜索方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Text plagiarism is growing rapidly with the development of Internet, so many plagiarism detection algorithms have been proposed. However, most algorithms focus on the optimized one-to-one comparison, rather than massive document comparison. The latter algorithms have a limitation in time performance when users conduct an exhaustive search on a huge set of documents. In this paper, we propose an optimized preprocessing model to detect similar text in massive document repositories. This model uses an efficient data structure called GDIC (Global Dictionary) for preprocessing. After filtering stop words, we choose pairs of documents to be inspected using two methods at the same time, both of which use the concept of a common non-stop word to choose pairs of documents to be inspected, each of which uses it in a slightly different way. The first method chooses pairs of documents with a high frequency of common non-stop words in documents in each of these pairs, while the second method chooses pairs with a high proportion of common non-stop words. We experimentally prove the performance of the model. Our experiments with the proposed preprocessing model is drastically reduced searching time to 64~87%, while the sensitivity stands at 77~96%. When we use this model, GDIC generation time accounts for a large proportion of all of the detection time. In future work, we will optimize GDIC creation time to improve the performance of the entire system.

机译：随着互联网的发展，文本窃迅速增长，因此提出了许多窃检测算法。但是，大多数算法都专注于优化的一对一比较，而不是大量的文档比较。当用户对大量文档进行详尽搜索时，后一种算法在时间性能上有局限性。在本文中，我们提出了一种优化的预处理模型来检测大量文档存储库中的相似文本。该模型使用称为GDIC（全局字典）的高效数据结构进行预处理。过滤停用词后，我们同时使用两种方法选择要检查的文档对，这两种方法都使用一个通用的不停用词的概念来选择要检查的文档对，每种方法都在一个文档中使用它。方式略有不同。第一种方法选择这些对中每对文档中具有频繁不停词的文档对，而第二种方法选择具有高比例不停词的文档对。我们通过实验证明了该模型的性能。我们使用提出的预处理模型进行的实验将搜索时间大幅度减少到64％到87％，而灵敏度为77％到96％。当我们使用此模型时，GDIC生成时间占所有检测时间的很大一部分。在以后的工作中，我们将优化GDIC的创建时间，以改善整个系统的性能。

著录项

来源
《11th IEEE International Conference on Computer and Information Technology》|2011年|p.526-532|共7页
会议地点
作者
Park Sun-Young; Kim Seon Yeong; Kim Sung-Hwan; Cho Hwan-Gue;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
dictionary; information retrieval; plagiarism; text similarity;

机译：字典;信息检索;抄袭;文本相似度;

相似文献

外文文献
中文文献
专利

1. Augmenting Medical Decision Making With Text-Based Search of Teaching File Repositories and Medical Ontologies: Text-Based Search of Radiology Teaching Files [J] . Priya Deshpande, Alexander Rasin, Eli T Brown, International journal of knowledge discovery in bioinformatics . 2018,第2期

机译：通过基于文本的教学文件存储库和医学本体搜索增强医疗决策：基于文本的放射学教学文件搜索
2. Automatic Folder Allocation System for Electronic Text Document Repositories Using Enhanced Bayesian Classification Approach [J] . Wou Onn Choo, Lam Hong Lee, Yen Pei Tay, International Journal of Intelligent Information Technologies . 2019,第2期

机译：使用增强型贝叶斯分类方法的电子文本文档存储库自动文件夹分配系统
3. Text Document Retrieval In English Using Keywords of Indonesian Dictionary Based [J] . Jati Sasongko Wibowo, Sri Hartati Indonesian Journal of Computing and Cybernetics Systems . 2011,第1期

机译：基于印度尼西亚语词典关键词的英语文本文档检索
4. Fast Globally Optimal Search in Tree-Structured Dictionaries [C] . Yan Huang, Ilya Pollak, Minh N. Do, Wavelets XI . 2005

机译：树状字典中的快速全局最优搜索
5. Semantic search and information retrieval techniques for text repositories. [D] . Singh, Lisham Lekhendro. 2012

机译：文本存储库的语义搜索和信息检索技术。
6. Thematic clustering of text documents using an EM-based approach [O] . Sun Kim, W John Wilbur 2012

机译：使用基于EM的方法对文本文档进行主题聚类
7. TM-SGTD: Text Mining Based Semantic Graph for Text Document Approach for Text Representation [O] . Ashish Pacharne, Pramod S Nair, Srinivasa Rao D 2017

机译：TM-SGTD：文本文档方法的文本挖掘语义图文本表示

A Global Dictionary Based Approach to Fast Similar Text Search in Document Repository

摘要

著录项

相似文献

相关主题

期刊订阅