首页> 外文期刊>Cloud Computing, IEEE Transactions on >EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud
【24h】

EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud

机译:EPAS:一种基于采样的云相似度识别算法

获取原文
获取原文并翻译 | 示例
           

摘要

The explosive growth of data brings new challenges to the data storage and management in cloud environment. These data usually have to be processed in a timely fashion in the cloud. Thus, any increased latency may cause a massive loss to the enterprises. Similarity detection plays a very important role in data management. Many typical algorithms such as Shingle, Simhash, Traits and Traditional Sampling Algorithm (TSA) are extensively used. The Shingle, Simhash and Traits algorithms read entire source file to calculate the corresponding similarity characteristic value, thus requiring lots of CPU cycles and memory space and incurring tremendous disk accesses. In addition, the overhead increases with the growth of data set volume and results in a long delay. Instead of reading entire file, TSA samples some data blocks to calculate the fingerprints as similarity characteristics value. The overhead of TSA is fixed and negligible. However, a slight modification of source files will trigger the bit positions of file content shifting. Therefore, a failure of similarity identification is inevitable due to the slight modifications. This paper proposes an Enhanced Position-Aware Sampling algorithm (EPAS) to identify file similarity for the cloud by modulo file length. EPAS concurrently samples data blocks from the head and the tail of the modulated file to avoid the position shift incurred by the modifications. Meanwhile, an improved metric is proposed to measure the similarity between different files and make the possible detection probability close to the actual probability. Furthermore, this paper describes a query algorithm to reduce the time overhead of similarity detection. Our experimental results demonstrate that the EPAS significantly outperforms the existing well known algorithms in terms of time overhead, CPU and memory occupation. Moreover, EPAS makes a more preferable tradeoff between precision and recall than that of other similarity detection algorithms. Therefore, it is an effective approach of similarity identification for the cloud.
机译:数据的爆炸性增长给云环境中的数据存储和管理带来了新的挑战。这些数据通常必须及时在云中进行处理。因此,任何增加的等待时间都可能给企业造成巨大损失。相似性检测在数据管理中起着非常重要的作用。许多典型的算法,例如碎片,Simhash,特征和传统采样算法(TSA)被广泛使用。 Shingle,Simhash和Traits算法读取整个源文件以计算相应的相似性特征值,因此需要大量的CPU周期和内存空间,并且需要进行大量磁盘访问。另外,开销随着数据集数量的增加而增加,并导致较长的延迟。 TSA不会读取整个文件,而是对一些数据块进行采样,以将指纹计算为相似性特征值。 TSA的开销是固定的,可以忽略不计。但是,对源文件进行少量修改将触发文件内容移位的位位置。因此,由于稍加修改,不可避免地会导致相似性识别失败。本文提出了一种增强的位置感知采样算法(EPAS),可以通过对文件长度取模来识别云的文件相似性。 EPAS同时从调制文件的头和尾采样数据块,以避免修改引起的位置偏移。同时,提出了一种改进的度量标准来度量不同文件之间的相似性,并使可能的检测概率接近实际概率。此外,本文描述了一种减少相似性检测时间开销的查询算法。我们的实验结果表明,EPAS在时间开销,CPU和内存占用方面显着优于现有的知名算法。而且,与其他相似性检测算法相比,EPAS在精度和查全率之间取得了更好的折衷。因此,这是一种有效的云相似度识别方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号