EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud

Yongtao Zhou; Yuhui Deng; Junjie Xie; Laurence T. Yang

首页> 外文期刊>Cloud Computing, IEEE Transactions on >EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud

【24h】

EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud

机译：EPAS：一种基于采样的云相似度识别算法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The explosive growth of data brings new challenges to the data storage and management in cloud environment. These data usually have to be processed in a timely fashion in the cloud. Thus, any increased latency may cause a massive loss to the enterprises. Similarity detection plays a very important role in data management. Many typical algorithms such as Shingle, Simhash, Traits and Traditional Sampling Algorithm (TSA) are extensively used. The Shingle, Simhash and Traits algorithms read entire source file to calculate the corresponding similarity characteristic value, thus requiring lots of CPU cycles and memory space and incurring tremendous disk accesses. In addition, the overhead increases with the growth of data set volume and results in a long delay. Instead of reading entire file, TSA samples some data blocks to calculate the fingerprints as similarity characteristics value. The overhead of TSA is fixed and negligible. However, a slight modification of source files will trigger the bit positions of file content shifting. Therefore, a failure of similarity identification is inevitable due to the slight modifications. This paper proposes an Enhanced Position-Aware Sampling algorithm (EPAS) to identify file similarity for the cloud by modulo file length. EPAS concurrently samples data blocks from the head and the tail of the modulated file to avoid the position shift incurred by the modifications. Meanwhile, an improved metric is proposed to measure the similarity between different files and make the possible detection probability close to the actual probability. Furthermore, this paper describes a query algorithm to reduce the time overhead of similarity detection. Our experimental results demonstrate that the EPAS significantly outperforms the existing well known algorithms in terms of time overhead, CPU and memory occupation. Moreover, EPAS makes a more preferable tradeoff between precision and recall than that of other similarity detection algorithms. Therefore, it is an effective approach of similarity identification for the cloud.

机译：数据的爆炸性增长给云环境中的数据存储和管理带来了新的挑战。这些数据通常必须及时在云中进行处理。因此，任何增加的等待时间都可能给企业造成巨大损失。相似性检测在数据管理中起着非常重要的作用。许多典型的算法，例如碎片，Simhash，特征和传统采样算法（TSA）被广泛使用。 Shingle，Simhash和Traits算法读取整个源文件以计算相应的相似性特征值，因此需要大量的CPU周期和内存空间，并且需要进行大量磁盘访问。另外，开销随着数据集数量的增加而增加，并导致较长的延迟。 TSA不会读取整个文件，而是对一些数据块进行采样，以将指纹计算为相似性特征值。 TSA的开销是固定的，可以忽略不计。但是，对源文件进行少量修改将触发文件内容移位的位位置。因此，由于稍加修改，不可避免地会导致相似性识别失败。本文提出了一种增强的位置感知采样算法（EPAS），可以通过对文件长度取模来识别云的文件相似性。 EPAS同时从调制文件的头和尾采样数据块，以避免修改引起的位置偏移。同时，提出了一种改进的度量标准来度量不同文件之间的相似性，并使可能的检测概率接近实际概率。此外，本文描述了一种减少相似性检测时间开销的查询算法。我们的实验结果表明，EPAS在时间开销，CPU和内存占用方面显着优于现有的知名算法。而且，与其他相似性检测算法相比，EPAS在精度和查全率之间取得了更好的折衷。因此，这是一种有效的云相似度识别方法。

著录项

来源
《Cloud Computing, IEEE Transactions on》 |2018年第3期|720-733|共14页
作者
Yongtao Zhou; Yuhui Deng; Junjie Xie; Laurence T. Yang;
展开▼
作者单位

Department of Computer Science, Jinan University, Guangzhou, P.R. China;

Department of Computer Science, Jinan University, Guangzhou, China;

Department of Computer Science, Jinan University, Guangzhou, P.R. China;

Department of Computer Science, St. Francis Xavier University, Antigonish, NS, Canada;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Cloud computing; Bandwidth; Google; Web pages; Explosives; Delays;

机译：云计算;带宽;Google;网页;爆炸物;延迟;

相似文献

外文文献
中文文献
专利

1. Adaptive and similarity-based tradeoff algorithms in a price-timeslot-QoS negotiation system to establish cloud SLAs [J] . Son Seokho, Sim Kwang Mong Information systems frontiers . 2015,第3期

机译：价格-时隙-QoS协商系统中基于自适应和相似度的折衷算法，用于建立云SLA
2. Research on point cloud computation compression algorithm based on feature vector similarity [J] . Wang Hongxu, Wang Liguo, Qi Zheng Basic & clinical pharmacology & toxicology. . 2019,第S1期

机译：基于特征向量相似点的点云计算压缩算法研究
3. Research on point cloud computation compression algorithm based on feature vector similarity [J] . Wang Hongxu, Wang Liguo, Qi Zheng Basic & clinical pharmacology & toxicology. . 2019,第S6期

机译：基于特征向量相似点的点云计算压缩算法研究
4. Detection and species identification of Cryptosporidium in river water samples using EPA Method 1623 and PCR-RFLP [C] . Fu-Chih Hsu, Choi-Iok Wong, James Larkin, Water quality technology conference;WQTC . 2001

机译：使用EPA方法1623和PCR-RFLP检测和鉴定河水中的隐孢子虫
5. An Algorithm for Automated Satellite-Based River Ice Identification Using a Local Cloud Mask: Application over the Lower Susquehanna River Using VIIRS and MODIS [D] . Kraatz, Simon G. 2017

机译：一种使用局部云掩码的基于卫星的河冰自动识别算法：使用VIIRS和MODIS在萨斯奎哈纳河下游应用
6. Cloud-Coffee: implementation of a parallel consistency-based multiple alignment algorithm in the T-Coffee package and its benchmarking on the Amazon Elastic-Cloud [O] . Paolo Di Tommaso, Miquel Orobitg, Fernando Guirado, -1

机译：Cloud-Coffee：在T-Coffee软件包中基于并行一致性的多重对齐算法的实现及其在Amazon Elastic-Cloud上的基准测试
7. Figure 4: (A) One conserved sequence, which occurs 79 times in 46,264 binding site peaks from the ChIP-seq data-set. The mutation profile of this conserved sequence is illustrated, where ’_ ’ indicates this base is unchanged; DEL indicates this base is lost; INS X indicates a new base X is inserted in front of this base. (B) Several repeated elements patterns are listed. (C) In the first column, the top five DNA motifs, mined by meme-chip tools (Machanick Bailey, 2011) are illustrated. The resemblant conserved sequences, found by the CFSP algorithm are listed in the second column. In the third column, the position-specific scoring matrices, which are transformed from mutational information are listed. The similarity between meme motif and resemblant conserved sequence with PSSM format was calculated via a stamp motif comparison tool (Mahony Benos, 2007). The E-values for the similarity of those pairs is displayed in the fourth column. (D) One motif is selected in each group clustered by gkmsvm descriptors, and the corresponding motif found by the CFSP algorithm is listed below. (E) There are additional datasets (File No: ENCFF100GRL, ENCFF616IRT, ENCFF870CER, Target: SREBF1) collected from https://www.encodeproject.org. The top two motifs are selected in each file using meme tools, and the corresponding motifs found by our algorithm are listed below. [O] . -1

机译：图4：（a）一种保守序列，其发生在芯片-SEQ数据集中的46,264个结合位点峰值中的79倍。说明了这种保守序列的突变分布，其中'_'表示该碱度不变; del表示此基础丢失; INS X表示新的基础X插入此基础前面。（b）列出了几种重复的元素模式。（c）在第一栏中，示出了由MEME芯片工具（Machanick＆Bailey，2011）开采的前五个DNA主题。由CFSP算法发现的相应保守序列列于第二列中。在第三列中，列出了从突变信息转换的特定位置的评分矩阵。 MEME主题与PSSM格式的相似性与PSSM格式之间的相似性通过邮票图章比较工具（Mahony＆Benos，2007）计算。这些对相似性的电子值显示在第四列中。（d）在由GKMSVM描述符聚集的每个组中选择了一个图案，下面列出了CFSP算法的相应主题。（e）从https://www.encodeproject.org收集的，有附加数据集（文件no：cernff100grl，cenf616irl，conf8.20cer，target：srebf1）。使用MEME工具在每个文件中选择前两个图案，并且我们的算法发现的相应主题如下所示。
8. Health Consultation: Review of Activity Based Sampling, Former W. R. Grace Facility (a/k/a North Little Rock Auto Salvage), 600 Dixie Lane, North Little Rock, Pulaski County, AR 72114. EPA Identification Number: ARN000607042. AFIN Number: 60-02502, January 23, 2014 [R] . 2014

机译：健康咨询：回顾基于活动的抽样，前W. R. Grace设施（a / k / a North Little Rock auto salvage），600 Dixie Lane，North Little Rock，pulaski County，aR 72114.Epa识别号码：aRN000607042。 aFIN编号：60-02502，2014年1月23日

EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅