Identifying File Similarity in Large Data Sets by Modulo File Length

机译：通过模文件长度识别大数据集中的文件相似性

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Identifying file similarity is very important for data management. Sampling files is a simple and effective approach to identify the file similarity. However, the traditional sampling algorithm( TSA) is very sensitive to file modification. For example, a single bit shift would result in a failure of similarity detection. Many research efforts have been invested in solving/alleviating this problem. This paper proposes a Position-Aware Sampling(PAS) algorithm to identify file similarity in large data sets by modulo file length. This method is very effective in dealing with file modification when performing similarity detection. Comprehensive experimental results demonstrate that PAS significantly outperforms a well-known similarity detection algorithm called simhash in terms of precision and recall. Furthermore, the time overhead, CPU and memory occupation of PAS are much less than that of simhash.

机译：识别文件相似性对于数据管理非常重要。采样文件是一种识别文件相似性的简单有效的方法。但是，传统的采样算法（TSA）对文件修改非常敏感。例如，单个移位将导致相似性检测失败。为了解决/减轻这个问题，已经进行了许多研究工作。提出了一种位置感知采样（PAS）算法，通过对文件长度取模来识别大型数据集中的文件相似性。在执行相似度检测时，此方法在处理文件修改方面非常有效。全面的实验结果表明，PAS在准确性和查全率方面明显优于著名的相似性检测算法simhash。此外，PAS的时间开销，CPU和内存占用要比simhash少得多。

著录项

来源
《International conference on algorithms and architectures for parallel processing》|2014年|136-149|共14页
会议地点
作者
Yongtao Zhou; Yuhui Deng; Xiaoguang Chen; Junjie Xie;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
file similarity; large data sets; position shifted; simhash;

机译：文件相似度;大数据集;位置偏移;辛哈什;

相似文献

外文文献
中文文献
专利

1. De-identifying a public use microdata file from the Canadian national discharge abstract database [J] . Khaled El Emam, David Paton, Fida Dankar, BMC Medical Informatics and Decision Making . 2011,第1期

机译：从加拿大国家排放摘要数据库中取消标识公共用途的微数据文件
2. File similarity evaluation scheme for multimedia data using partial hash information [J] . Kim Byung-Kwan, Oh Su-Jin, Jang Sung-Bong, Multimedia Tools and Applications . 2017,第19期

机译：使用部分哈希信息的多媒体数据文件相似性评估方案
3. Similarity-based data mining in files of two-dimensional chemical structures using fingerprint measures of molecular resemblance [J] . Willett Peter Wiley interdisciplinary reviews. Data mining and knowledge discovery . 2011,第3期

机译：使用分子相似性指纹测量法在二维化学结构文件中基于相似度的数据挖掘
4. Identifying File Similarity in Large Data Sets by Modulo File Length [C] . Yongtao Zhou, Yuhui Deng, Xiaoguang Chen, ICA3PP 2014 . 2014

机译：Modulo文件长度识别大数据集中的文件相似度
5. CORRESPONDENCE FILING--A CRITICAL ANALYSIS OF THE CONTENTS OF PUBLISHED PRACTICE SETS USED IN COLLEGE TEACHING WITH PARTICULAR REFERENCE TO FILING RULES USED IN OFFICES [D] . JOHNSON, MINA MARIE -1

机译：对应备案-对大学教学中使用的惯例集的内容进行批判性分析，尤其是参考办公室中使用的备案规则
6. De-identifying a public use microdata file from the Canadian national discharge abstract database [O] . Khaled El Emam, David Paton, Fida Dankar, 2011

机译：从加拿大国家排放摘要数据库中取消标识公共用途的微数据文件
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。
8. Addition of Work Rate and Temperature Information to the Augmented NMRI Standard (ANS) Data Files in the NMR198 Subset of the USN N2-O2 Primary Data Set. [R] . D. J. Doolette K. A. Gault W. A. Gerth 2011

机译：将工作速率和温度信息添加到UsN N2-O2主要数据集的NmR198子集中的增强NmRI标准（aNs）数据文件。

Identifying File Similarity in Large Data Sets by Modulo File Length

摘要

著录项

相似文献

相关主题

期刊订阅