Scale-Out Acceleration for Machine Learning

机译：机器学习的扩展加速度

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The growing scale and complexity of Machine Learning (ML) algorithms has resulted in prevalent use of distributed general-purpose systems. In a rather disjoint effort, the community is focusing mostly on high performance single-node accelerators for learning. This work bridges these two paradigms and offers CoSMIC, a full computing stack constituting language, compiler, system software, template architecture, and circuit generators, that enable programmable acceleration of learning at scale. CoSMIC enables programmers to exploit scale-out acceleration using FPGAs and Programmable ASICs (P-ASICs) from a high-level and mathematical Domain-Specific Language (DSL). Nonetheless, CoSMIC does not require programmers to delve into the onerous task of system software development or hardware design. CoSMIC achieves three conflicting objectives of efficiency, automation, and programmability, by integrating a novel multi-threaded template accelerator architecture and a cohesive stack that generates the hardware and software code from its high-level DSL. CoSMIC can accelerate a wide range of learning algorithms that are most commonly trained using parallel variants of gradient descent. The key is to distribute partial gradient calculations of the learning algorithms across the accelerator-augmented nodes of the scale-out system. Additionally, CoSMIC leverages the parallelizability of the algorithms to offer multi-threaded acceleration within each node. Multi-threading allows CoSMIC to efficiently exploit the numerous resources that are becoming available on modern FPGAs/P-ASICs by striking a balance between multi-threaded parallelism and single-threaded performance. CoSMIC takes advantage of algorithmic properties of ML to offer a specialized system software that optimizes task allocation, role-assignment, thread management, and internode communication. We evaluate the versatility and efficiency of CoSMIC for 10 different machine learning applications from various domains. On average, a 16-node CoSMIC with UltraScale+FPGAs offers 18.8× speedup over a 16-node Spark system with Xeon processors while the programmer only writes 22-55 lines of code. CoSMIC offers higher scalability compared to the state-of-the-art Spark; scaling from 4 to 16 nodes with CoSMIC yields 2.7× improvements whereas Spark offers 1.8×. These results confirm that the full-stack approach of CoSMIC takes an effective and vital step towards enabling scale-out acceleration for machine learning.

机译：机器学习（ML）算法的日益增长和复杂性导致了普遍使用分布式通用系统。在一个相当不相信的努力中，社区主要关注高性能单节点加速器以进行学习。这项工作桥接这两个范式并提供宇宙，一个完整的计算堆栈构成语言，编译器，系统软件，模板架构和电路发电机，可以在比例下进行可编程加速度学习。 COSMIC使程序员能够利用来自高级和数学域 - 特定语言（DSL）的FPGA和可编程ASIC（P-ASIC）的扩展加速度。尽管如此，COSMIC不需要程序员探讨系统软件开发或硬件设计的繁重任务。宇宙通过集成新颖的多线程模板加速器架构和凝聚力堆栈来实现效率，自动化和可编程性的三个相互矛盾的效率，自动化和可编程性目标。从其高级DSL生成硬件和软件代码。 COSMIC可以加速使用梯度下降的并行变体训练的广泛的学习算法。关键是分配横跨横向系统的加速器增强节点的学习算法的部分渐变计算。另外，COSMIC利用算法的并行性来提供每个节点内的多线程加速度。多线程允许COSMIC通过在多螺纹并行和单线性能之间敲击平衡来有效地利用现代FPGAS / P-ASIC上的许多资源。 COSMIC利用ML的算法属性来提供专门的系统软件，可优化任务分配，角色分配，线程管理和节间通信。我们评估了来自各个域的10种不同机器学习应用的宇宙的多功能性和效率。平均而言，具有UltraScale + FPGA的16节点宇宙提供18.8倍的加速，通过Xeon处理器的16节点火花系统，而程序员只写入22-55行代码。与最先进的火花相比，宇宙提供更高的可扩展性;使用宇宙产量的4到16个节点缩放2.7×改善，而Spark提供1.8×。这些结果证实，COSMIC的全堆栈方法对机器学习的扩展加速来实现有效和重要的步骤。

著录项

来源
《International Symposium on Microarchitecture》|2017年|xix 825 p. :|共15页
会议地点
作者
Jongse Park; Hardik Sharma; Divya Mahajan; Joon Kyung Kim; Preston Olds; Hadi Esmaeilzadeh;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP302-532;
关键词
application specific integrated circuits; electronic engineering computing; field programmable gate arrays; learning (artificial intelligence); microprocessor chips; multiprocessing systems; multi-threading; program compilers; specification languages;

机译：专用集成电路;电子工程计算;现场可编程门阵列;学习（人工智能）;微处理器芯片;多加工系统;多线程;程序编译器;规格语言;

相似文献

外文文献
中文文献
专利

1. Edge-adaptable serverless acceleration for machine learning Internet of Things applications [J] . Zhang Michael, Krintz Chandra, Wolski Rich Software, practice & experience . 2021,第9期

机译：用于机器学习内容的边缘适应的无服务器加速度
2. Machine learning acceleration for nonlinear solvers applied to multiphase porous media flow [J] . Silva V. L. S., Salinas P., Jackson M. D., Computer Methods in Applied Mechanics and Engineering . 2021,第Octa1期

机译：用于多相多孔介质流动的非线性溶剂的机器学习加速度
3. FPGA-based Hardware Acceleration for SVM Machine Learning Algorithm [J] . Jakjoud Fatimazahra, Hatim Anas, Abella Bouaaddi E3S Web of Conferences . 2021,第a期

机译：基于FPGA的SVM机器学习算法的硬件加速
4. Scale-Out Acceleration for Machine Learning [C] . Jongse Park, Hardik Sharma, Divya Mahajan, Annual IEEE/ACM International Symposium on Microarchitecture . 2017

机译：机器学习的横向扩展加速
5. Applied Machine Learning for Resource Provisioning of Data-Intensive Applications on Scale-Out Platforms and Its Security Challenges [D] . Mohammadi Makrani, Hosein. 2021

机译：应用机器学习，用于资源配置数据密集型应用在扩展平台及其安全挑战中
6. Tibial Acceleration-Based Prediction of Maximal Vertical Loading Rate During Overground Running: A Machine Learning Approach [O] . Rud Derie, Pieter Robberechts, Pieter Van den Berghe, 2020

机译：基于胫骨加速度的地面跑步最大垂直负荷率预测：一种机器学习方法
7. Bridging the Semantic Gaps of GPU Acceleration for Scale-out CNN-based Big Data Processing [O] . Mingcong Song, Yang Hu, Yunlong Xu, 2016

机译：弥合GPU加速的语义间隙，以便扩展基于CNN的大数据处理

Scale-Out Acceleration for Machine Learning

摘要

著录项

相似文献

相关主题

期刊订阅