Scalable reinforcement learning on Cray XC

Kommaraju Ananda V; Maschhoff Kristyn J.; Ringenburg Michael F.; Robbins Benjamin

首页> 外文期刊>Concurrency, practice and experience >Scalable reinforcement learning on Cray XC

【24h】

Scalable reinforcement learning on Cray XC

机译：Cray XC上可扩展的强化学习

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Recent advancements in deep learning have made reinforcement learning (RL) applicable to a much broader range of decision making problems. However, the emergence of reinforcement learning workloads brings multiple challenges to system resource management. RL applications continuously train a deep learning or a machine learning model while interacting with uncertain simulation models. This new generation of AI applications impose significant demands on system resources such as memory, storage, network, and compute. In this paper, we describe a typical RL application workflow and introduce the Ray distributed execution framework developed at the UC Berkeley RISELab. Ray includes the RLlib library for executing distributed reinforcement learning applications. We describe a recipe for deploying the Ray execution framework on Cray XC systems and demonstrate scaling of RLlib algorithms across multiple nodes of the system. We also explore performance characteristics across multiple CPU and GPU node types.

机译：深度学习的最新进步使得强化学习（RL）适用于更广泛的决策问题。然而，加强学习工作负载的出现为系统资源管理带来了多种挑战。 RL应用程序在与不确定的模拟模型进行交互时持续培训深度学习或机器学习模型。这一新一代AI应用程序对系统资源（如内存，存储，网络和计算）施加了显着的要求。在本文中，我们描述了一个典型的RL应用程序工作流程，并在UC Berkeley Riselab中引入了开发的射线分布式执行框架。 Ray包括用于执行分布式增强学习应用程序的RLLIB库。我们描述了用于在CRAY XC系统上部署RAY执行框架的配方，并演示跨系统的多个节点的RLLIB算法的缩放。我们还探讨了多个CPU和GPU节点类型的性能特征。

著录项

来源
《Concurrency, practice and experience》 |2020年第20期|e5636.1-e5636.12|共12页
作者
Kommaraju Ananda V; Maschhoff Kristyn J.; Ringenburg Michael F.; Robbins Benjamin;
展开▼
作者单位

Microsoft Corp One Microsoft Way Redmond WA 98052 USA;

Cray Inc Seattle WA USA;

Cray Inc Seattle WA USA;

Cray Inc Seattle WA USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
deep learning; high-performance computing; reinforcement learning; scaling;

机译：深入学习;高性能计算;加固学习;缩放;

相似文献

外文文献
中文文献
专利

1. Cray programming environments within containers on Cray XC systems [J] . Martinasso Maxime, Gila Miguel, Sawyer William, Concurrency, practice and experience . 2020,第20期

机译：CRAY XC系统容器中的CRAY编程环境
2. Swarm robots reinforcement learning convergence accuracy-based learning classifier systems with gradient descent (XCS-GD) [J] . Jie Shao, Haixia Lin, Kaibian Zhang Neural computing & applications . 2014,第2期

机译：群体机器人强化学习基于梯度下降的基于学习精度的学习分类器系统（XCS-GD）
3. An Analysis of Rule Deletion Scheme in XCS on Reinforcement Learning Problem [J] . Masaya Nakata, Tomoki Hamagami Journal of Advanced Computatioanl Intelligence and Intelligent Informatics . 2017,第5a125期

机译：XC在加固学习问题中规则删除方案分析
4. The Performance and Scalability of the SHMEM and Corresponding MPI-3 Routines on a Cray XC30 [C] . Gianina Alina Negoita, Glenn R. Luecke, Marina Kraeva, IEEE International Symposium on Parallel and Distributed Computing . 2017

机译：Cray XC30上SHMEM和相应的MPI-3例程的性能和可伸缩性
5. Large-Scale Multi-Agent Decision-Making Using Mean Field Game Theory and Reinforcement Learning [D] . Zhou, Zejian. 2021

机译：使用均值野外博弈论和强化学习的大规模多代理决策
6. Building Large-Scale Quantitative Imaging Databases with Multi-Scale Deep Reinforcement Learning: Initial Experience with Whole-Body Organ Volumetric Analyses [O] . David J. Winkel, Hanns-Christian Breit, Thomas J. Weikert, 2021

机译：具有多尺度深度加强学习的大规模定量成像数据库：全身器官体积分析的初始经验
7. Robot reinforcement learning accuracy-based learning classifier systems with Fuzzy Policy Gradient descent(XCS-FPGRL) [O] . Jie Shao, Jingru Yu 2015

机译：基于机器人加强学习精确的基于学习分类器系统，具有模糊政策梯度下降（XCS-FPGR1）

Scalable reinforcement learning on Cray XC

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅