CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows

SHAOHUA DUAN; PRADEEP SUBEDI; PHILIP DAVIS; KEITA TERANISHI; HEMANTH KOLLA; MARC GAMELL; MANISH PARASHAR

首页> 外文期刊>ACM Transactions on Parallel Computing >CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows

【24h】

CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows

机译：Corec：用于原位工作流的可扩展和弹性内存数据分段

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The dramatic increase in the scale of current and planned high-end HPC systems is leading new challenges, such as the growing costs of data movement and IO, and the reduced mean time between failures (MTBF) of system components. In-situ workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, the resilience of these staging-based solutions has not been addressed, and they remain susceptible to expensive data failures. Furthermore, naive use of data resilience techniques such as n-way replication and erasure codes can impact latency and/or result in significant storage overheads. In this article, we present CoREC, a scalable and resilient in-memory data staging runtime for large-scale in-situ workflows. CoREC uses a novel hybrid approach that combines dynamic replication with erasure coding based on data access patterns. It also leverages multiple levels of replications and erasure coding to support diverse data resiliency requirements. Furthermore, the article presents optimizations for load balancing and conflict-avoiding encoding, and a low overhead, lazy data recovery scheme. We have implemented the CoREC runtime and have deployed with the DataSpaces staging service on leadership class computing machines and present an experimental evaluation in the article. The experiments demonstrate that CoREC can tolerate in-memory data failures while maintaining low latency and sustaining high overall storage efficiency at large scales.

机译：当前和计划高端HPC系统规模的急剧增加是引领新的挑战，例如日益增长的数据移动和IO的成本，以及系统组件的故障（MTBF）之间的平均时间。原位工作流程，即在HPC系统上执行整个应用程序工作流，它被出现为通过移动较近数据的计算来解决数据相关挑战的有吸引力的方法，并且基于分段的框架已经有效地用于支持 - 在比例下的原位工作流程。然而，尚未解决这些基于分期的解决方案的抵御能力，并且它们仍然易于昂贵的数据失败。此外，NaiRAC READION和擦除代码等数据恢复技术的天真使用可以影响延迟和/或导致显着的存储开销。在本文中，我们呈现Corec，可扩展和弹性内存中的内存数据暂存运行时，用于大规模的原位工作流程。 Corec使用一种新颖的混合方法，该方法将动态复制与基于数据访问模式的擦除编码相结合。它还利用多个级别的复制和擦除编码来支持各种数据弹性要求。此外，该物品介绍了负载平衡和冲突避免编码的优化，以及低开销惰性数据恢复方案。我们已经实施了Corec运行时，并在领导类计算机上部署了数据分子分级服务，并在文章中展示了实验评估。实验表明，COREC可以耐受内存数据故障，同时保持低延迟并在大尺度保持高的整体储存效率。

著录项

来源
《ACM Transactions on Parallel Computing》 |2020年第2期|12.1-12.29|共29页
作者
SHAOHUA DUAN; PRADEEP SUBEDI; PHILIP DAVIS; KEITA TERANISHI; HEMANTH KOLLA; MARC GAMELL; MANISH PARASHAR;
展开▼
作者单位

Rutgers Discovery Informatics Institute Rutgers University USA;

Rutgers Discovery Informatics Institute Rutgers University USA;

Rutgers Discovery Informatics Institute Rutgers University USA;

Sandia National Laboratory USA;

Sandia National Laboratory USA;

Intel USA;

Rutgers Discovery Informatics Institute Rutgers University USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Data resilience; erasure codes; replication; in-situ workflows; data staging;

机译：数据弹性;擦除代码;复制;原位工作流程;数据分期;

相似文献

外文文献
中文文献
专利

1. In-memory staging and data-centric task placement for coupled scientific simulation workflows [J] . Fan Zhang, Tong Jin, Qian Sun, Concurrency and computation: practice and experience . 2017,第12期

机译：内存中暂存和以数据为中心的任务放置，用于耦合的科学模拟工作流
2. Distributed in-memory data management for workflow executions [J] . Renan Souza, Vitor Silva, Alexandre A. B. Lima, PeerJ Computer Science . 2021,第a期

机译：用于工作流执行的内存数据管理分布式数据管理
3. Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework [J] . Tanveer Ahmad, Nauman Ahmed, Zaid Al-Ars, BMC Genomics . 2020,第S10期

机译：使用Apache Arrow内存数据框架优化Gatk工作流的性能
4. Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows [C] . Pradeep Subedi, Philip Davis, Shaohua Duan, International Conference for High Performance Computing, Networking, Storage and Analysis . 2018

机译：Stacker：一种自动数据移动引擎，用于基于超大规模数据分阶段的现场工作流
5. A Shared-Memory Coupled Architecture to Leverage Big Data Frameworks in Prototyping and In-situ Analytics for Data Intensive Scientific Workflows [D] . ?Lemon, Alexander Michael 2019

机译：共享内存耦合架构，用于利用原型设计和原位分析的大数据框架，以获取数据密集型科学工作流程
6. Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework [O] . Tanveer Ahmad, Nauman Ahmed, Zaid Al-Ars, 2020

机译：使用Apache Arrow内存数据框架优化Gatk工作流的性能
7. Scalable and Resilient Workflow Executions on Production Distributed Computing Infrastructures [O] . Rojas Balderrama, Javier, Truong Huu, Tram, Montagnat, Johan 2012

机译：生产分布式计算基础架构上的可伸缩，弹性工作流执行

CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅