首页> 外文期刊>IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems >Robust Identification of Thermal Models for In-Production High-Performance-Computing Clusters With Machine Learning-Based Data Selection
【24h】

Robust Identification of Thermal Models for In-Production High-Performance-Computing Clusters With Machine Learning-Based Data Selection

机译:基于机器学习的数据选择的生产高性能计算群集热模型的鲁棒识别

获取原文
获取原文并翻译 | 示例

摘要

Power and thermal management are critical components of high-performance-computing (HPC) systems, due to their high-power density and large total power consumption. The assessment of thermal dissipation by means of compact models directly from the thermal response of the final device enables more robust and precise thermal control strategies as well as automated diagnosis. However, when dealing with large-scale systems "in production," the accuracy of learned thermal models depends on the dynamics of the power excitation, which depends also on the executed workload, and measurement nonidealities such as quantization. In this article we show that, using an advanced system identification algorithm, we are able to generate very accurate thermal models (average error lower than our sensors quantization step of 1 degrees C) for a large-scale HPC system on real workloads for very long time periods. However, we also show that: 1) not all real workloads allow for the identification of a good model and 2) starting from the theory of system identification it is very difficult to evaluate if a trace of data leads to a good estimated model. We then propose and validate a set of techniques based on machine learning and deep learning algorithms for the choice of data traces to be used for model identification. We also show that deep learning techniques are absolutely necessary to correctly choose such traces up to 96% of the times.
机译:功率和热管理是高性能计算(HPC)系统的关键组件,因为它们的高功率密度和总功耗大。通过直接从最终装置的热响应的紧凑型模型进行热量耗散的评估使热控制策略更加坚固和精确的热控制策略以及自动诊断。然而,在处理大规模系统“在生产中,”学习热模型的准确性取决于电力激励的动态,这也取决于所执行的工作量,以及诸如量化的测量非前沿。在本文中,我们表明,使用先进的系统识别算法,我们能够为实际工作负载上的大规模HPC系统产生非常精确的热模型(比我们的传感器量化步骤为1摄氏度的平均误差)时间段。但是,我们还表明:1)并非所有真实工作负载都允许识别良好的模型和2)从系统识别理论开始,如果一丝数据导致良好的估计模型,则非常困难。然后,我们提出并验证了基于机器学习和深度学习算法的一组技术,以便选择用于模型识别的数据迹线。我们还表明,绝对必要的深度学习技术可以正确选择高达96%的迹象。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号