首页> 外文会议>IEEE International Symposium on Parallel Distributed Processing >Scalable failure recovery for high-performance data aggregation
【24h】

Scalable failure recovery for high-performance data aggregation

机译:高性能数据聚合的可扩展故障恢复

获取原文

摘要

Many high-performance tools, applications and infrastructures, such as Paradyn, STAT, TAU, Ganglia, SuperMon, Astrolabe, Borealis, and MRNet, use data aggregation to synthesize large data sets and reduce data volumes while retaining relevant information content. Hierarchical or tree-based overlay networks (TBONs) are often used to execute data aggregation operations in a scalable, piecewise fashion. In this paper, we present state compensation, a scalable failure recovery model for high-bandwidth, low-latency TBON computations. By leveraging inherently redundant state information found in many TBON computations, state compensation avoids explicit state replication (for example, process checkpoints and message logging) and incurs no overhead in the absence of failures. Further, when failures do occur, state compensation uses a weak data consistency model and localized protocols that allow processes to recover from failures independently and responsively. Based on a formal specification of our data aggregation model, we have validated state compensation and identified its assumptions and limitations: state compensation requires that data aggregation operations be associative, commutative and idempotent. In this paper, we describe the fundamental state compensation concepts and a prototype implementation integrated into the MRNet TBON infrastructure. Our experiments with this framework suggest that for TBONs supporting up to millions of application processes, state compensation can yield millisecond recovery latencies and inconsequential application perturbation.
机译:许多高性能工具,应用程序和基础架构,如Paradyn,Stat,Tau,Ganglia,Supermon,Astrolabe,Borealis和MRNET,使用数据聚合来综合大数据集并在保留相关信息内容的同时减少数据量。基于分层或基于树的覆盖网络(TBONS)通常用于以可扩展的分段方式执行数据聚合操作。在本文中,我们呈现出状态补偿,高带宽,低延迟金额计算的可扩展故障恢复模型。通过利用在许多金属计算中发现的固有冗余状态信息,状态补偿避免了显式状态复制(例如,进程检查点和消息记录),并且在没有失败的情况下扰乱开销。此外,当发生故障时,状态补偿使用弱数据一致性模型和本地化协议,允许流程独立和响应地从故障恢复。基于我们的数据聚合模型的正式规范,我们已经验证了状态补偿并确定了其假设和限制:状态补偿要求数据聚合操作是关联,交换和幂等。在本文中,我们描述了集成到MRNET 1NTBON基础架构中的基本状态补偿概念和原型实现。我们的实验与本框架建议,对于支持多达数百万个应用程序的TBONS,国家补偿可以产生毫秒恢复延迟和无关紧要的应用程序扰动。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号