首页> 外文会议>IEEE international conference on distributed computing systemss >Will They Blend?: Exploring Big Data Computation Atop Traditional HPC NAS Storage
【24h】

Will They Blend?: Exploring Big Data Computation Atop Traditional HPC NAS Storage

机译:它们会融合吗?:在传统的HPC NAS存储上探索大数据计算

获取原文

摘要

The Apache Hadoop framework has rung in a new era in how data-rich organizations can process, store, and analyze large amounts of data. This has resulted in increased potential for an infrastructure exodus from the traditional solution of commercial database ad-hoc analytics on network-attached storage (NAS). While many data-rich organizations can afford to either move entirely to Hadoop for their Big Data analytics, or to maintain their existing traditional infrastructures and acquire a new set of infrastructure solely for Hadoop jobs, most supercomputing centers do not enjoy either of those possibilities. Too much of the existing scientific code is tailored to work on massively parallel file systems unlike the Hadoop Distributed File System (HDFS), and their datasets are too large to reasonably maintain and/or ferry between two distinct storage systems. Nevertheless, as scientists search for easier-to-program frameworks with a lower time-to-science to post-process their huge datasets after execution, there is increasing pressure to enable use of MapReduce within these traditional High Performance Computing (HPC) architectures. Therefore, in this work we explore potential means to enable use of the easy-to-program Hadoop MapReduce framework without requiring a complete infrastructure overhaul from existing HPC NAS solutions. We demonstrate that retaining function-dedicated resources like NAS is not only possible, but can even be effected efficiently with MapReduce. In our exploration, we unearth subtle pitfalls resultant from this mash-up of new-era Big Data computation on conventional HPC storage and share the clever architectural configurations that allow us to avoid them. Last, we design and present a novel Hadoop File System, the Reliable Array of Independent NAS File System (RainFS), and experimentally demonstrate its improvements in performance and reliability over the previous architectures we have investigated.
机译:在数据丰富的组织如何处理,存储和分析大量数据的过程中,Apache Hadoop框架进入了一个新时代。这导致基础设施外流的潜力从传统的基于网络附加存储(NAS)的商业数据库临时分析解决方案中流失。尽管许多数据丰富的组织有能力要么完全迁移到Hadoop进行大数据分析,要么维护他们现有的传统基础架构并购买一套专门用于Hadoop工作的新基础架构,但是大多数超级计算中心都不享受这些可能性。与Hadoop分布式文件系统(HDFS)不同,太多的现有科学代码是为在大规模并行文件系统上工作而量身定制的,并且它们的数据集太大,无法在两个不同的存储系统之间进行合理的维护和/或传递。然而,随着科学家们在寻找易于编程的,科学时间较短的框架以在执行后对其庞大的数据集进行后处理时,在这些传统的高性能计算(HPC)架构中启用MapReduce的压力越来越大。因此,在这项工作中,我们探索了潜在的手段,使人们能够使用易于编程的Hadoop MapReduce框架,而无需对现有HPC NAS解决方案进行全面的基础架构检修。我们证明,保留功能专用资源(如NAS)不仅是可能的,而且甚至可以通过MapReduce有效地实现。在我们的探索中,我们发掘了传统HPC存储上新时代大数据计算的这种混搭所产生的细微陷阱,并分享了使我们能够避免它们的聪明的架构配置。最后,我们设计并提出了一种新颖的Hadoop文件系统,即独立NAS文件系统的可靠阵列(RainFS),并通过实验证明了其在性能和可靠性方面比我们之前研究的体系结构有所提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号