首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >Supporting HPC Analytics Applications with Access Patterns Using Data Restructuring and Data-Centric Scheduling Techniques in MapReduce
【24h】

Supporting HPC Analytics Applications with Access Patterns Using Data Restructuring and Data-Centric Scheduling Techniques in MapReduce

机译:使用MapReduce中的数据重组和以数据为中心的调度技术,通过访问模式支持HPC Analytics应用程序

获取原文
获取原文并翻译 | 示例
       

摘要

Current High Performance Computing (HPC) applications have seen an explosive growth in the size of data in recent years. Many application scientists have initiated efforts to integrate data-intensive computing into computational-intensive HPC facilities, particularly for data analytics. We have observed several scientific applications which must migrate their data from an HPC storage system to a data-intensive one for analytics. There is a gap between the data semantics of HPC storage and data-intensive system, hence, once migrated, the data must be further refined and reorganized. This reorganization must be performed before existing data-intensive tools such as MapReduce can be used to analyze data. This reorganization requires at least two complete scans through the data set and then at least one MapReduce program to prepare the data before analyzing it. Running multiple MapReduce phases causes significant overhead for the application, in the form of excessive I/O operations. That is for every MapReduce phase, a distributed read and write operation on the file system must be performed. Our contribution is to develop a MapReduce-based framework for HPC analytics to eliminate the multiple scans and also reduce the number of data preprocessing MapReduce programs. We also implement a data-centric scheduler to further improve the performance of HPC analytics MapReduce programs by maintaining the data locality. We have added additional expressiveness to the MapReduce language to allow application scientists to specify the logical semantics of their data such that 1) the data can be analyzed without running multiple data preprocessing MapReduce programs, and 2) the data can be simultaneously reorganized as it is migrated to the data-intensive file system. Using our augmented Map-Reduce system, MapReduce with Access Patterns (MRAP), we have demonstrated up to 33 percent throughput improvement in one real application, and up to 70 percent in an I/O kernel of another appl- cation. Our results for scheduling show up to 49 percent improvement for an I/O kernel of a prevalent HPC analysis application.
机译:近年来,当前的高性能计算(HPC)应用程序的数据量呈爆炸式增长。许多应用科学家已经开始努力将数据密集型计算集成到计算密集型HPC设施中,尤其是用于数据分析。我们已经观察到一些科学应用程序,这些应用程序必须将其数据从HPC存储系统迁移到数据密集型应用程序以进行分析。 HPC存储的数据语义与数据密集型系统之间存在差距,因此,一旦迁移,则必须进一步完善和重组数据。必须先执行这种重组,然后才能使用现有的数据密集型工具(例如MapReduce)来分析数据。这种重组要求对数据集进行至少两次完整扫描,然后至少分析一个MapReduce程序以在分析数据之前准备数据。运行多个MapReduce阶段会以​​过多的I / O操作形式给应用程序带来可观的开销。也就是说,对于每个MapReduce阶段,必须在文件系统上执行分布式读写操作。我们的贡献是为HPC分析开发基于MapReduce的框架,以消除多次扫描并减少数据预处理MapReduce程序的数量。我们还实现了以数据为中心的调度程序,以通过保持数据局部性来进一步提高HPC分析MapReduce程序的性能。我们为MapReduce语言增加了更多的表达性,以使应用程序科学家可以指定其数据的逻辑语义,以便:1)可以在不运行多个数据预处理MapReduce程序的情况下分析数据,以及2)可以按原样同时重组数据迁移到数据密集型文件系统。使用我们的增强型Map-Reduce系统,即带访问模式的MapReduce(MRAP),我们已证明在一个实际应用程序中吞吐量提高了33%,在另一应用程序的I / O内核中提高了70%。我们的调度结果显示,对于常见的HPC分析应用程序的I / O内核,性能提高了49%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号