首页> 外文会议>2012 8th International Conference on Network and Service Management. >Failure analysis of distributed scientific workflows executing in the cloud
【24h】

Failure analysis of distributed scientific workflows executing in the cloud

机译:在云中执行的分布式科学工作流的故障分析

获取原文
获取原文并翻译 | 示例

摘要

This work presents models characterizing failures observed during the execution of large scientific applications on Amazon EC2. Scientific workflows are used as the underlying abstraction for application representations. As scientific workflows scale to hundreds of thousands of distinct tasks, failures due to software and hardware faults become increasingly common. We study job failure models for data collected from 4 scientific applications, by our Stampede framework. In particular, we show that a Naive Bayes classifier can accurately predict the failure probability of jobs. The models allow us to predict job failures for a given execution resource and then use these failure predictions for two higher-level goals: (1) to suggest a better job assignment, and (2) to provide quantitative feedback to the workflow component developer about the robustness of their application codes.
机译:这项工作提出了表征在Amazon EC2上执行大型科学应用程序期间观察到的故障的模型。科学的工作流程被用作应用程序表示的基础抽象。随着科学工作流扩展到成千上万的不同任务,由于软件和硬件故障而导致的故障变得越来越普遍。我们通过Stampede框架研究从4种科学应用程序收集的数据的工作失败模型。特别是,我们证明了朴素贝叶斯分类器可以准确地预测作业的失败概率。这些模型使我们能够预测给定执行资源的作业失败,然后将这些失败预测用于两个更高级别的目标:(1)建议更好的作业分配;(2)向工作流组件开发人员提供有关以下方面的定量反馈:应用程序代码的健壮性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号