...
首页> 外文期刊>International Journal of Information Technology and Computer Science >An Efficient String Matching Technique for Desktop Search to Detect Duplicate Files
【24h】

An Efficient String Matching Technique for Desktop Search to Detect Duplicate Files

机译:用于桌面搜索以检测重复文件的高效字符串匹配技术

获取原文
           

摘要

Information retrieval is used to identify the relevant documents in a document collection, which is matching a user's query. It also refers to the automatic retrieval of documents from the large document corpus. The most important application of information retrieval system is search engine like Google, which identify those documents on the World Wide Web that are relevant to user queries. In most situations, users may download the files that are already downloaded and stored in their computer. Then, there is a chance of multiple copies of the files that are already stored in different drives and folders on the system, which in turn reduces the performance of the system and these files occupy a lot of memory space. Analyzing the contents of the file and finding their similarity is one of the major problems in text mining and information retrieval. The main objective of this research work is to analyze the file contents and deletes the duplicate files in the system. In order to perform this task, this research work proposes a new tool named Duplicate File Detector Tool i.e. DFDT. DFDT helps the user to search and delete duplicate files in the system at a minimum time. It also helps to delete the duplicate files not only with the same file category, but also with different file categories. Boyer Moore Horspool and Knuth Morris Pratt string searching algorithms are existing algorithms and these algorithms are used to compare the file contents for finding their similarity. This work also proposes a new string matching algorithm named as W2COM (Word to Word COMparison). From the experimental results it is observed that the newly proposed W2COM string matching algorithm performance is better than Boyer Moore Horspool and Knuth Morris Pratt algorithms.
机译:信息检索用于标识文档集中与用户查询匹配的相关文档。它还指从大型文档语料库中自动检索文档。信息检索系统最重要的应用是Google之类的搜索引擎,它可以识别万维网上与用户查询相关的那些文档。在大多数情况下,用户可以下载已经下载并存储在计算机中的文件。这样,就有可能存在已存储在系统上不同驱动器和文件夹中的文件的多个副本,从而降低了系统的性能,并且这些文件占用了大量内存空间。分析文件的内容并找到它们的相似性是文本挖掘和信息检索中的主要问题之一。这项研究工作的主要目的是分析文件内容并删除系统中的重复文件。为了执行此任务,这项研究工作提出了一个名为“重复文件检测器工具”的新工具,即DFDT。 DFDT帮助用户在最短的时间搜索和删除系统中的重复文件。它还不仅可以删除具有相同文件类别的重复文件,而且还可以删除具有不同文件类别的重复文件。 Boyer Moore Horspool和Knuth Morris Pratt字符串搜索算法是现有算法,这些算法用于比较文件内容以查找它们的相似性。这项工作还提出了一种新的字符串匹配算法,称为W2COM(Word to Word COMparison)。从实验结果可以看出,新提出的W2COM字符串匹配算法的性能优于Boyer Moore Horspool和Knuth Morris Pratt算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号