首页> 外文期刊>Future generation computer systems >Self-healing of workflow activity incidents on distributed computing infrastructures
【24h】

Self-healing of workflow activity incidents on distributed computing infrastructures

机译:分布式计算基础架构上工作流活动事件的自我修复

获取原文
获取原文并翻译 | 示例

摘要

Distributed computing infrastructures are commonly used through scientific gateways, but operating these gateways requires important human intervention to handle operational incidents. This paper presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. From their degree, incidents are classified in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. We specifically study the long-tail effect issue, and propose a new algorithm to control task replication. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up execution up to a factor of 4, consumes up to 26% less resource time than a control execution and properly detects unrecoverable errors.
机译:分布式计算基础结构通常通过科学网关使用,但是操作这些网关需要重要的人工干预来处理操作事件。本文提出了一种自我修复过程,该过程可以通过测量长尾效应,应用程序效率,数据传输问题和特定于站点的问题的指标来量化工作流活动的事件发生程度。这些度量标准非常简单,可以在线进行计算,并且对应用程序或资源特征没有任何假设。根据事件的程度,将事件按级别进行分类,并与基于建模事件级别之间的相关性的关联规则选择的愈合操作集相关联。我们专门研究了长尾效应问题,并提出了一种控制任务复制的新算法。恢复过程是根据在欧洲网格基础设施上生产中获得的实际应用程序轨迹进行参数化的。在虚拟映像平台上获得的实验结果表明,所提出的方法将执行速度提高了4倍,比控制执行节省了多达26%的资源时间,并且可以正确检测到不可恢复的错误。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号