首页> 外文会议>2010 IEEE International Symposium on Parallel amp; Distributed Processing (IPDPS) >Scalable failure recovery for high-performance data aggregation
【24h】

Scalable failure recovery for high-performance data aggregation

机译:可扩展的故障恢复,以实现高性能数据聚合

获取原文
获取原文并翻译 | 示例

摘要

Many high-performance tools, applications and infrastructures, such as Paradyn, STAT, TAU, Ganglia, SuperMon, Astrolabe, Borealis, and MRNet, use data aggregation to synthesize large data sets and reduce data volumes while retaining relevant information content. Hierarchical or tree-based overlay networks (TBONs) are often used to execute data aggregation operations in a scalable, piecewise fashion. In this paper, we present state compensation, a scalable failure recovery model for high-bandwidth, low-latency TBON computations. By leveraging inherently redundant state information found in many TBON computations, state compensation avoids explicit state replication (for example, process checkpoints and message logging) and incurs no overhead in the absence of failures. Further, when failures do occur, state compensation uses a weak data consistency model and localized protocols that allow processes to recover from failures independently and responsively. Based on a formal specification of our data aggregation model, we have validated state compensation and identified its assumptions and limitations: state compensation requires that data aggregation operations be associative, commutative and idempotent. In this paper, we describe the fundamental state compensation concepts and a prototype implementation integrated into the MRNet TBON infrastructure. Our experiments with this framework suggest that for TBONs supporting up to millions of application processes, state compensation can yield millisecond recovery latencies and inconsequential application perturbation.
机译:许多高性能工具,应用程序和基础结构,例如Paradyn,STAT,TAU,Ganglia,SuperMon,Astrolabe,Borealis和MRNet,都使用数据聚合来合成大型数据集并减少数据量,同时保留相关的信息内容。分层或基于树的覆盖网络(TBON)通常用于以可伸缩的分段方式执行数据聚合操作。在本文中,我们提出了状态补偿,一种用于高带宽,低延迟TBON计算的可扩展故障恢复模型。通过利用许多TBON计算中固有的冗余状态信息,状态补偿可以避免显式的状态复制(例如,过程检查点和消息记录),并且在没有故障的情况下不会产生任何开销。此外,当确实发生故障时,状态补偿会使用弱数据一致性模型和本地化协议,从而使流程能够独立,响应地从故障中恢复。根据我们数据聚合模型的正式规范,我们验证了状态补偿并确定了其假设和局限性:状态补偿要求数据聚合操作必须具有关联性,可交换性和幂等性。在本文中,我们描述了基本的状态补偿概念以及集成到MRNet TBON基础结构中的原型实现。我们在此框架下进行的实验表明,对于支持多达数百万个应用程序流程的TBON,状态补偿可能会产生毫秒级的恢复延迟,并产生无关紧要的应用程序干扰。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号