...
首页> 外文期刊>IEEE Transactions on Computers >ApproxSSD: Data Layout Aware Sampling on an Array of SSDs
【24h】

ApproxSSD: Data Layout Aware Sampling on an Array of SSDs

机译:ApproxSSD:阵列SSD上的数据布局感知采样

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Execution of analytic frameworks on sample data sets is the current trend in response to increasing data size and demand for real-time analysis. Additionally, high-performance, energy-efficient Solid-State Drive (SSD) arrays are the primary storage subsystem for parallel data analysis systems. To exploit the benefits of SSD arrays when executing sample data set analytics, several key areas must be considered. First, due to logical to physical address translation, random data choice in data sampling jobs can cause unbalanced workloads among SSDs in the array. Second, after the data choice, existing task schedulers in data analysis frameworks can introduce non-negligible resource contentions resulting from the suboptimal Input/Output (I/O). The performance of SSDs is unpredictable because of their varying maintenance costs at runtime, which renders them hard to be managed by the scheduler. With the trend towards sample set data analytics and the use of SSDs, it is increasingly important to ensure balanced workloads and minimize resource contention. Without addressing these areas, sample-set data analytics on SSDs will continue to suffer from performance inefficiencies. In this paper, we propose ApproxSSD to perform on-disk layout-aware data sampling on SSD arrays. This proposed framework leverages data selection and task scheduling to improve the performance of many applications. ApproxSSD decouples I/O from the computation in task execution. This avoids potential I/O contentions and suboptimal workload balances. We have developed an open-source prototype system of ApproxSSD in Scala at Github. Our evaluation shows that ApproxSSD can achieve up to 2.7 times speed up at 10 percent sampling ratio under an example sampling workload when compared to Spark, while simultaneously maintaining high output accuracy.
机译:在样本数据集上执行分析框架是当前趋势,这是对数据量的增长和实时分析需求的回应。此外,高性能,高能效的固态驱动器(SSD)阵列是并行数据分析系统的主要存储子系统。为了在执行样本数据集分析时利用SSD阵列的优势,必须考虑几个关键领域。首先,由于逻辑到物理地址的转换,数据采样作业中的随机数据选择会导致阵列中SSD的工作负载不平衡。其次,在选择数据之后,数据分析框架中的现有任务调度程序可能会引入由次优输入/输出(I / O)引起的不可忽略的资源争用。 SSD的性能是不可预测的,因为它们在运行时的维护成本各不相同,这使得它们很难由调度程序进行管理。随着样本集数据分析和SSD的使用趋势,确保平衡的工作量和最小化资源争用变得越来越重要。如果不解决这些问题,SSD上的样本集数据分析将继续遭受性能低下的困扰。在本文中,我们建议ApproxSSD在SSD阵列上执行磁盘上可感知布局的数据采样。该提议的框架利用数据选择和任务调度来改善许多应用程序的性能。在任务执行中,ApproxSSD使I / O与计算脱钩。这避免了潜在的I / O争用和次优工作负载平衡。我们已经在Github的Scala开发了ApproxSSD的开源原型系统。我们的评估表明,与Spark相比,在示例采样工作量下,ApproxSSD在10%的采样率下可以达到2.7倍的速度,同时还保持了高输出精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号