首页> 外文会议>Parallel and Distributed Processing and Applications; Lecture Notes in Computer Science; 4330 >Framework for Enabling Highly Available Distributed Applications for Utility Computing
【24h】

Framework for Enabling Highly Available Distributed Applications for Utility Computing

机译:启用用于公用事业计算的高可用性分布式应用程序的框架

获取原文
获取原文并翻译 | 示例

摘要

The move towards IT outsourcing is the first step towards an environment where compute infrastructure is treated as a service. In utility computing this IT service has to honor Service Level Agreements (SLA) in order to meet the desired Quality of Service (QoS) guarantees. Such an environment requires reliable services in order to maximize the utilization of the resources and to decrease the Total Cost of Ownership (TCO). Such reliability cannot come at the cost of resource duplication, since it increases the TCO of the data center and hence the cost per compute unit. We, in this paper, look into aspects of projecting impact of hardware failures on the SLAs and techniques required to take proactive recovery steps in case of a predicted failure. By maintaining health vectors of all hardware and system resources, we predict the failure probability of resources based on observed hardware errors/failure events, at runtime. This inturn influences an availability aware middleware to take proactive action (even before the application is affected in case the system and the application have low recoverability).The proposed framework has been prototyped on a system running HP-UX. Our offline analysis of the prediction system on hardware error logs indicate no more than 10% false positives. This work to the best of our knowledge is the first of its kind to perform an end-to-end analysis of the impact of a hardware fault on application SLAs, in a live system.
机译:迈向IT外包是迈向将计算基础架构视为服务的环境的第一步。在公用计算中,此IT服务必须遵守服务级别协议(SLA),才能满足所需的服务质量(QoS)保证。这种环境需要可靠的服务,以最大程度地利用资源并降低总拥有成本(TCO)。这样的可靠性不能以资源重复为代价,因为它增加了数据中心的TCO,因此增加了每个计算单元的成本。在本文中,我们研究了硬件故障对SLA的影响预测方面以及在发生预期故障时采取主动恢复步骤所需的技术。通过维护所有硬件和系统资源的运行状况向量,我们可以在运行时基于观察到的硬件错误/故障事件来预测资源的故障概率。这反过来会影响可用性意识的中间件采取主动行动(即使在应用程序受到影响时,以防系统和应用程序的可恢复性很低)。建议的框架已在运行HP-UX的系统上进行了原型设计。我们根据硬件错误日志对预测系统进行的离线分析表明,误报率不超过10%。据我们所知,这项工作是同类产品中的第一个,它可以在实时系统中对硬件故障对应用程序SLA的影响进行端到端分析。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号