首页> 外文会议>International Conference on High Performance Computing >Anomaly detection in large-scale coalition clusters for dependability assurance
【24h】

Anomaly detection in large-scale coalition clusters for dependability assurance

机译:大规模联盟集群中的异常检测可靠性保证

获取原文

摘要

In large-scale high-performance computing systems, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs are becoming an increasingly important concern to system designers and administrators. When a compute node fails to function properly, health-related data are valuable for troubleshooting. However, it is challenging to effectively identify anomalies from the voluminous amount of noisy, high-dimensional data. Manual detection is time-consuming and error-prone. It does not scale well. In this paper, we present an autonomic mechanism for anomaly detection in coalition clusters. It is composed of a set of techniques that facilitates automatic analysis of system health data. We apply data transformation to format health data in a uniform manner. Then principal variables are chosen by feature selection, which reduces the data size. Clustering and outlier detection are explored to identify nodes with anomalous behavior. We evaluate our prototype implementation on a production institution-wide computational grid. The results show that our mechanism can effectively detect faulty nodes with high accuracy and low computation overhead.
机译:在大型高性能计算系统中,组件故障成为规范而不是例外。失败发生以及其对系统性能和运营成本的影响正在成为系统设计师和管理员越来越重要的关注。当计算节点无法正常运行时,与健康相关的数据对于故障排除是有价值的。然而,有效地识别来自大量嘈杂,高维数据的异常挑战。手动检测是耗时和容易出错的。它没有很好地扩展。在本文中,我们提出了联盟集群中异常检测的自主机制。它由一系列技术组成,便于自动分析系统健康数据。我们应用数据转换以统一的方式格式化运行状况数据。然后通过特征选择选择主变量,从而降低了数据大小。探讨聚类和异常检测以识别具有异常行为的节点。我们在生产机构宽的计算网格上评估我们的原型实施。结果表明,我们的机制可以有效地检测具有高精度和低计算开销的故障节点。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号