首页> 外文会议>11th IEEE International Conference on Computer and Information Technology >A Global Dictionary Based Approach to Fast Similar Text Search in Document Repository
【24h】

A Global Dictionary Based Approach to Fast Similar Text Search in Document Repository

机译:基于全局字典的文档库中相似文本快速搜索方法

获取原文

摘要

Text plagiarism is growing rapidly with the development of Internet, so many plagiarism detection algorithms have been proposed. However, most algorithms focus on the optimized one-to-one comparison, rather than massive document comparison. The latter algorithms have a limitation in time performance when users conduct an exhaustive search on a huge set of documents. In this paper, we propose an optimized preprocessing model to detect similar text in massive document repositories. This model uses an efficient data structure called GDIC (Global Dictionary) for preprocessing. After filtering stop words, we choose pairs of documents to be inspected using two methods at the same time, both of which use the concept of a common non-stop word to choose pairs of documents to be inspected, each of which uses it in a slightly different way. The first method chooses pairs of documents with a high frequency of common non-stop words in documents in each of these pairs, while the second method chooses pairs with a high proportion of common non-stop words. We experimentally prove the performance of the model. Our experiments with the proposed preprocessing model is drastically reduced searching time to 64~87%, while the sensitivity stands at 77~96%. When we use this model, GDIC generation time accounts for a large proportion of all of the detection time. In future work, we will optimize GDIC creation time to improve the performance of the entire system.
机译:随着互联网的发展,文本窃迅速增长,因此提出了许多窃检测算法。但是,大多数算法都专注于优化的一对一比较,而不是大量的文档比较。当用户对大量文档进行详尽搜索时,后一种算法在时间性能上有局限性。在本文中,我们提出了一种优化的预处理模型来检测大量文档存储库中的相似文本。该模型使用称为GDIC(全局字典)的高效数据结构进行预处理。过滤停用词后,我们同时使用两种方法选择要检查的文档对,这两种方法都使用一个通用的不停用词的概念来选择要检查的文档对,每种方法都在一个文档中使用它。方式略有不同。第一种方法选择这些对中每对文档中具有频繁不停词的文档对,而第二种方法选择具有高比例不停词的文档对。我们通过实验证明了该模型的性能。我们使用提出的预处理模型进行的实验将搜索时间大幅度减少到64%到87%,而灵敏度为77%到96%。当我们使用此模型时,GDIC生成时间占所有检测时间的很大一部分。在以后的工作中,我们将优化GDIC的创建时间,以改善整个系统的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号