首页> 外文会议>IEEE International Conference on Big Data >Deep Learning for Enhancing Fault Tolerant Capabilities of Scientific Workflows
【24h】

Deep Learning for Enhancing Fault Tolerant Capabilities of Scientific Workflows

机译:深入学习,提升科学工作流的容错能力

获取原文

摘要

In the history of Computer Science, the act of 'delegation' has been the greatest multiplier of society's problem-solving ability. A scientist working on detecting anomalies in a phenomenon, does not need to re-invent matrix multiplication techniques to solve her problem. Scientific workflows provide ultimate 'delegation' mechanism - where a domain scientist can completely forget the specifics of 'how' her program will execute on a large cluster in an efficient and cost-effective manner and can instead focus on the mathematical formulation and theoretical robustness of her solution. We present here an approach that directly aims to make the execution of Scientific Workflows more reliable, robust and efficient. We aim that the work presented in this paper will propel the larger effort, from the scientific workflow community, of making scientific workflow execution as simple, efficient and robust as a JOIN operation in a modern database. Specifically, we apply Deep Learning techniques to develop a mechanism that forecasts the final state (success or failure) of a dynamic job in a large-scale particle physics experiment, with minimal data gathering, and as early as possible in job's life cycle. The key advantage of having a predictive mechanism to identify and anticipate failure-prone jobs is the potential for designing intelligent Fault Tolerance mechanisms to handle anomalous events. We achieve a 14% improvement in computational resources utilization, and an overall classification accuracy of 85% on real tasks executed in a High Energy Physics Computing workflow. To the best of our knowledge, this is the most exhaustive and first of its kind study of neural network architectures in context of a real-dataset profiled from a large-scale scientific workflow.
机译:在计算机科学的历史中,“代表团”的行为是社会问题解决能力的最大乘数。在一种经历中检测异常的科学家,不需要重新发明矩阵乘法技术来解决她的问题。科学工作流提供终极的“代表团”机制 - 域名科学家可以完全忘记“如何以高效且经济效益的方式在大集群上执行”如何“的细节,而是专注于数学制定和理论稳健性她的解决方案。我们在这里展示了一种直接旨在使科学工作流程更加可靠,稳健而有效的方法的方法。我们的目标是本文提出的工作将推动科学工作流社区的更大努力,使科学工作流程执行简单,高效且强大,作为现代数据库中的连接操作。具体而言,我们应用深度学习技术,以制定一种机制,该机制预测在大规模粒子物理实验中的动态工作的最终状态(成功或失败),最小的数据收集,以及在工作的生命周期中尽早。具有识别和预测失败作业的预测机制的关键优点是设计智能容错机制以处理异常事件的可能性。在高能物理计算工作流程中执行的实际任务,我们在计算资源利用率上提高了14%的改善,整体分类准确性为85%。据我们所知,这是一个最令人遗憾的,首先在从大规模的科学工作流程中分布的真实数据集中的神经网络架构研究。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号