首页> 外文会议>International conference on algorithms and architectures for parallel processing >Identifying File Similarity in Large Data Sets by Modulo File Length
【24h】

Identifying File Similarity in Large Data Sets by Modulo File Length

机译:通过模文件长度识别大数据集中的文件相似性

获取原文

摘要

Identifying file similarity is very important for data management. Sampling files is a simple and effective approach to identify the file similarity. However, the traditional sampling algorithm( TSA) is very sensitive to file modification. For example, a single bit shift would result in a failure of similarity detection. Many research efforts have been invested in solving/alleviating this problem. This paper proposes a Position-Aware Sampling(PAS) algorithm to identify file similarity in large data sets by modulo file length. This method is very effective in dealing with file modification when performing similarity detection. Comprehensive experimental results demonstrate that PAS significantly outperforms a well-known similarity detection algorithm called simhash in terms of precision and recall. Furthermore, the time overhead, CPU and memory occupation of PAS are much less than that of simhash.
机译:识别文件相似性对于数据管理非常重要。采样文件是一种识别文件相似性的简单有效的方法。但是,传统的采样算法(TSA)对文件修改非常敏感。例如,单个移位将导致相似性检测失败。为了解决/减轻这个问题,已经进行了许多研究工作。提出了一种位置感知采样(PAS)算法,通过对文件长度取模来识别大型数据集中的文件相似性。在执行相似度检测时,此方法在处理文件修改方面非常有效。全面的实验结果表明,PAS在准确性和查全率方面明显优于著名的相似性检测算法simhash。此外,PAS的时间开销,CPU和内存占用要比simhash少得多。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号