首页> 外文期刊>Sustainable Computing >PowerCoord: Power capping coordination for multi-CPU/GPU servers using reinforcement learning
【24h】

PowerCoord: Power capping coordination for multi-CPU/GPU servers using reinforcement learning

机译:PowerCoord:使用加强学习的多CPU / GPU服务器的电源盖协调

获取原文
获取原文并翻译 | 示例
           

摘要

Modern supercomputers and cloud providers rely on server nodes that are equipped with multiple CPU sockets and general purpose GPUs (GPGPUs) to handle the high demand for intensive computations. These nodes consume much higher power than commodity servers, and integrating them with power capping systems used in modern clusters presents new challenges. In this paper, we propose a new power capping controller, PowerCoord, that is specifically designed for servers with multiple CPU and GPU sockets that are running multiple jobs at a time. PowerCoord coordinates among the various power domains (e.g., CPU sockets and GPUs) inside a node server to meet target power caps, while seeking to maximize throughput. Our approach also takes into consideration job deadlines and priorities. Because performance modeling for co-located jobs is error-prone, PowerCoord uses a learning method. PowerCoord has a number of heuristic policies to allocate power among the various CPUs and GPUs, and it uses reinforcement learning for policy selection during runtime. Based on the observed state of the system, PowerCoord shifts the distribution of selected policies. We implement our power cap controller on a real multi-CPU/GPU server with low overhead, and we demonstrate that it is able to meet target power caps while maximizing the throughput, and balancing other demands such as priorities and deadlines. Our results show PowerCoord improves the server throughput on average by 18% compared with the case when power is not coordinated among CPU/GPU domains. Also, PowerCoord improves the server throughput on average by 11% compared with prior work that uses a heuristic approach to coordinate the power among domains. (C) 2020 Elsevier Inc. All rights reserved.
机译:现代超级计算机和云提供商依赖于配备多个CPU套接字和通用GPU(GPGPU)的服务器节点来处理对密集计算的高需求。这些节点比商品服务器消耗了更高的功率,并将它们与现代集群中使用的电力盖系统集成,这呈现出新的挑战。在本文中,我们提出了一个新的电源封盖控制器PowerCoord,专门为具有多个CPU和GPU套接字的服务器一次进行专门设计的。 PowerConord在节点服务器内部的各种电源域(例如,CPU套接字和GPU)之间坐标,以满足目标电源帽,同时寻求最大化吞吐量。我们的方法还考虑了工作截止日期和优先事项。由于共同定位作业的性能建模是容易出错的,因此PowerCOORD使用学习方法。 PowerCoord拥有许多启发式政策,可以在各种CPU和GPU之间分配权力,并且在运行时使用加强学习进行策略选择。基于所观察到的系统状态,PowerCoord会转移所选政策的分布。我们在具有低开销的真实多CPU / GPU服务器上实现我们的电源帽控制器,我们证明它能够满足目标电源帽,同时最大化吞吐量,并平衡优先事项和截止日期的其他需求。与CPU / GPU域之间不协调的情况相比,我们的结果显示PowerCoord平均提高了18%的服务器吞吐量。此外,与使用启发式方法协调域之间的电力相比,PowerCoord平均提高了服务器吞吐量。 (c)2020 Elsevier Inc.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号