Optimizing CNNs on Multicores for Scalability, Performance and Goodput

Samyam Rajbhandari; Yuxiong He; Olatunji Ruwase; Michael Carbin; Trishul Chilimbi

首页> 外文期刊>Computer architecture news >Optimizing CNNs on Multicores for Scalability, Performance and Goodput

【24h】

Optimizing CNNs on Multicores for Scalability, Performance and Goodput

机译：针对可扩展性，性能和吞吐量优化多核上的CNN

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Convolutional Neural Networks (CNN) are a class of Artificial Neural Networks (ANN) that are highly efficient at the pattern recognition tasks that underlie difficult AI problems in a variety of domains, such as speech recognition, object recognition, and natural language processing. CNNs are, however, computationally intensive to train. This paper presents the first characterization of the performance optimization opportunities for training CNNs on CPUs. Our characterization includes insights based on the structure of the network itself (i.e., intrinsic arithmetic intensity of the convolution and its scalability under parallelism) as well as dynamic properties of its execution (i.e., sparsity of the computation). Given this characterization, we present an automatic framework called spg-CNN for optimizing CNN training on CPUs. It comprises of a computation scheduler for efficient parallel execution, and two code generators: one that optimizes for sparsity, and the other that optimizes for spatial reuse in convolutions. We evaluate spg-CNN using convolutions from a variety of real world benchmarks, and show that spg-CNN can train CNNs faster than state-of-the-art approaches by an order of magnitude.

机译：卷积神经网络（CNN）是一类人工神经网络（ANN），在模式识别任务中非常高效，而模式识别任务是各种领域中难以解决的AI问题的基础，例如语音识别，对象识别和自然语言处理。但是，CNN的计算量很大。本文介绍了在CPU上训练CNN的性能优化机会的第一个特征。我们的表征包括基于网络本身结构的见解（即卷积的内在算术强度及其在并行性下的可伸缩性）以及其执行的动态特性（即计算的稀疏性）。鉴于此特征，我们提出了一个称为spg-CNN的自动框架，用于优化CPU上的CNN训练。它由一个用于高效并行执行的计算调度程序和两个代码生成器组成：一个针对稀疏性进行优化，而另一个针对卷积中的空间复用进行优化。我们使用来自各种现实世界基准的卷积评估了spg-CNN，并显示spg-CNN可以比最先进的方法快一个数量级地训练CNN。

著录项

来源
《Computer architecture news》 |2017年第1期|267-280|共14页
作者
Samyam Rajbhandari; Yuxiong He; Olatunji Ruwase; Michael Carbin; Trishul Chilimbi;
展开▼
作者单位

The Ohio State University;

Microsoft Research;

Microsoft Research;

Microsoft Research;

Microsoft Research;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Performance analysis and optimization of parallel Best-First Search algorithms on multicore and cluster of multicore [J] . Victoria M. Sanz Journal of Computer Science and Technology . 2016,第1期

机译：多核和多核集群上并行最佳优先搜索算法的性能分析和优化
2. Performance-Energy Optimizations for Shared Vector Accelerators in Multicores [J] . Beldianu S.F., Ziavras S.G. Computers, IEEE Transactions on . 2015,第3期

机译：多核共享矢量加速器的性能能量优化
3. Test Schedule Optimization for Multicore SoCs: Handling Dynamic Voltage Scaling and Multiple Voltage Islands [J] . Kavousianos X., Chakrabarty K., Jain A., Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on . 2012,第11期

机译：多核SoC的测试计划优化：处理动态电压调整和多个电压岛
4. Comparing Performance of C Compilers Optimizations on Different Multicore Architectures [C] . Roger S. Machado, Ricardo B. Almeida, André D. Jardim, 2017 International Symposium on Computer Architecture and High Performance Computing Workshops . 2017

机译：比较不同多核体系结构上的C编译器优化的性能
5. Effective performance analysis and optimizations for memory intensive programs on multicore. [D] . Liu, Lixia. 2010

机译：对多核内存密集型程序进行有效的性能分析和优化。
6. Data and performance profiles applying an adaptive truncation criterion within linesearch-based truncated Newton methods in large scale nonconvex optimization [O] . Andrea Caliciotti, Giovanni Fasano, Stephen G. Nash, 2018

机译：在大规模基于非凸优化的基于行搜索的截断牛顿法中应用自适应截断准则的数据和性能概况
7. MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics [O] . Amith R. Mamidala, Rahul Kumar, Debraj De, 2009

机译：mpI现代多核集群的集体：性能优化和通信特性

Optimizing CNNs on Multicores for Scalability, Performance and Goodput

摘要

著录项

相似文献

相关主题

期刊订阅