首页> 外文会议>Design, Automation & Test in Europe Conference and Exhibition >An energy-aware fault tolerant scheduling framework for soft error resilient cloud computing systems
【24h】

An energy-aware fault tolerant scheduling framework for soft error resilient cloud computing systems

机译:软错误弹性云计算系统的能量感知容错调度框架

获取原文

摘要

For modern high performance systems, aggressive technology and voltage scaling has drastically increased their susceptibility to soft errors. At the grand scale of cloud computing, it is clear that soft error induced failures will occur far more frequently, but it is unclear as to how to effectively apply current error detection and fault tolerance techniques in scale. In this paper, we focus on energy-aware fault tolerant scheduling in public, multi-user cloud systems, and explore the three-way tradeoff between reliability (in terms of soft error resiliency), performance and energy. Through a systematically optimized resource allocation, error detection approach selection, virtual machine placement, spatial/temporal redundancy augmentation and task scheduling process, the cloud service provider can achieve high error coverage and fault tolerance confidence while minimizing global energy costs under user deadline constraints. Our scheduling algorithm includes a static scheduling phase that operates on task graph based workload inputs prior to execution, and a light-weight dynamic scheduler that migrates tasks during execution in case of excessive reexecutions. All schedules are evaluated on a runtime simulation engine that (1) mimics the performance fluctuations in cloud systems, and (2) supports the injection of arbitrary fault patterns. Compared to current virtual machine or task replication techniques, we are able to reduce overall application failure rates by over 50% with approximately 76% total energy overhead.
机译:对于现代高性能系统,积极的技术和电压缩放已大大增加了它们对软错误的敏感性。在云计算的大规模领域,很明显,由软错误引起的故障将更加频繁地发生,但是对于如何有效地大规模应用当前的错误检测和容错技术尚不清楚。在本文中,我们着重于公共,多用户云系统中的能量感知容错调度,并探讨了可靠性(就软错误弹性而言),性能和能量之间的三重折衷。通过系统优化的资源分配,错误检测方法选择,虚拟机放置,空间/时间冗余扩充和任务调度过程,云服务提供商可以实现高错误覆盖率和容错置信度,同时在用户截止日期约束下将全球能源成本降至最低。我们的调度算法包括一个静态调度阶段,该阶段在执行之前对基于任务图的工作负载输入进行操作,而轻量级动态调度程序则在执行期间过度执行任务时迁移任务。所有计划都在运行时仿真引擎上进行评估,该引擎(1)模拟云系统中的性能波动,并且(2)支持注入任意故障模式。与当前的虚拟机或任务复制技术相比,我们能够将总应用程序故障率降低50%以上,而总能源开销约为76%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号