首页> 外文会议>IEEE/ACM International Conference on Grid Computing >Troubleshooting Thousands of Jobs on Production Grids Using Data Mining Techniques
【24h】

Troubleshooting Thousands of Jobs on Production Grids Using Data Mining Techniques

机译:使用数据挖掘技术对生产网格上的数千名作业进行故障排除

获取原文

摘要

Large scale production computing grids introduce new challenges in debugging and troubleshooting. A user that submits a workload consisting of tens of thousands of jobs to a grid of thousands of processors has a good chance of receiving thousands of error messages as a result. How can one begin to reason about such problems? We propose that data mining techniques can be employed to classify failures according to the properties of the jobs and machines involved. We demonstrate this technique through several case studies on real workloads consisting of tens of thousands of jobs. We apply the same techniques to a year's worth of data on a 3000 CPU production grid and use it to gain a high level understanding of the system behavior.
机译:大规模生产计算网格在调试和故障排除中引入了新的挑战。将由数千个工作组成的工作负载的用户与成千上万的处理器网格组成,因此可能有很大的收到成千上万的错误消息。如何开始推理这些问题?我们建议使用数据挖掘技术来根据所涉及的工作和机器的属性对故障进行分类。我们通过几个关于由数万个工作组成的真实工作量的案例研究来展示这种技术。我们将相同的技术应用于3000 CPU生产网格上的一年数据,并使用它来获得对系统行为的高水平了解。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号