MERIT: Tensor Transform for Memory-Efficient Vision Processing on Parallel Architectures

首页> 外文期刊>IEEE transactions on very large scale integration (VLSI) systems >MERIT: Tensor Transform for Memory-Efficient Vision Processing on Parallel Architectures

【24h】

MERIT: Tensor Transform for Memory-Efficient Vision Processing on Parallel Architectures

机译：优点：用于并行架构上具有内存效率的视觉处理的Tensor变换

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Computationally intensive deep neural networks (DNNs) are well- suited to run on GPUs, but newly developed algorithms usually require the heavily optimized DNN routines to work efficiently, and this problem could be even more difficult for specialized DNN architectures. In this article, we propose a mathematical formulation that can be useful for transferring the algorithm optimization knowledge across computing platforms. We discover that data movement and storage inside parallel processor architectures can be viewed as tensor transforms across memory hierarchies, making it possible to describe many memory optimization techniques mathematically. Such transform, which we call memory-efficient ranged inner-product tensor (MERIT) transform, can be applied to not only DNN tasks but also many traditional machine learning and computer vision computations. Moreover, the tensor transforms can be readily mapped to existing vector processor architectures. In this article, we demonstrate that many popular applications can be converted to a succinct MERIT notation on GPUs, speeding up GPU kernels up to 20 times while using only half as many code tokens. We also use the principle of the proposed transform to design a specialized hardware unit called MERIT-z processor. This processor can be applied to a variety of DNN tasks as well as other computer vision tasks while providing comparable area and power efficiency to dedicated DNN application-specific integrated circuits (ASICs).

机译：计算密集型深度神经网络（DNN）非常适合在GPU上运行，但是新开发的算法通常需要经过高度优化的DNN例程才能有效地工作，而对于专用DNN架构，此问题可能更加困难。在本文中，我们提出了一种数学公式，该公式可用于在计算平台之间传递算法优化知识。我们发现并行处理器体系结构中的数据移动和存储可以看作是跨存储器层次结构的张量变换，从而可以用数学方法描述许多存储器优化技术。这种转换，我们称为内存有效的范围内积张量（MERIT）转换，不仅可以应用于DNN任务，而且可以应用于许多传统的机器学习和计算机视觉计算。而且，张量变换可以容易地映射到现有的矢量处理器体系结构。在本文中，我们演示了许多流行的应用程序都可以在GPU上转换为简洁的MERIT表示法，将GPU内核的速度提高了20倍，而只使用了一半的代码令牌。我们还使用所提出的变换原理设计一个称为MERIT-z处理器的专用硬件单元。该处理器可应用于各种DNN任务以及其他计算机视觉任务，同时为专用DNN专用集成电路（ASIC）提供可比的面积和功率效率。

著录项

来源
《IEEE transactions on very large scale integration (VLSI) systems》 |2020年第3期|791-804|共14页
作者

展开▼
作者单位

Natl Taiwan Univ Grad Inst Elect Engn Taipei 10617 Taiwan;

Skywatch Inc Taipei 10084 Taiwan|Inventec Inc Taipei 11167 Taiwan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Transforms; Magneto electrical resistivity imaging technique; Computer architecture; Tensile stress; Task analysis; Graphics processing units; Optimization; Neural network hardware; parallel programming; vector processors;

机译：转换;磁电阻率成像技术;计算机架构;拉伸应力任务分析;图形处理单元;优化;神经网络硬件;并行编程向量处理器;

相似文献

外文文献
中文文献
专利

1. A high-performance and memory-efficient VLSI architecture with parallel scanning method for 2-D lifting-based discrete wavelet transform [J] . Yeong-Kang Lai, Lien-Fei Chen, Yui-Chih Shih Consumer Electronics, IEEE Transactions on . 2009,第2期

机译：具有并行扫描方法的高性能且高效存储的VLSI架构，用于基于二维提升的离散小波变换
2. Memory-Efficient and High-Performance Parallel-Pipelined Architectures for 5/3 Forward and Inverse Discrete Wavelet Transform [J] . TZE-YUN SUNG WSEAS Transactions on Electronics . 2007,第2期

机译：用于5/3正向和反向离散小波变换的内存高效和高性能并行管道架构
3. Parallel spherical harmonic transforms on heterogeneous architectures (graphics processing units/multi-core CPUs) [J] . Mikolaj Szydlarski, Pierre Esterie, Joel Falcou, Concurrency and Computation . 2014,第3期

机译：异构体系结构（图形处理单元/多核CPU）上的并行球形谐波变换
4. Memory-Efficient and High-Performance Parallel-Pipelined Architectures for 5/3 Forward and Inverse Discrete Wavelet Transform [C] . TZE-YUN SUNG WSEAS International Conference on Multimedia Systems and Signal Processing . 2007

机译：用于5/3前进和逆离散小波变换的内存高效和高性能并行流水线架构
5. Automatic program parallelization using stateless parallel processing architecture. [D] . Sun, Feijian. 2004

机译：使用无状态并行处理体系结构的自动程序并行化。
6. A Parallel Architecture for the Partitioning around Medoids (PAM) Algorithm for Scalable Multi-Core Processor Implementation with Applications in Healthcare [O] . Hassan Mushtaq, Sajid Gul Khawaja, Muhammad Usman Akram, 2018

机译：围绕Medoids（PAM）算法进行分区的并行体系结构可实现可扩展的多核处理器及其在医疗保健中的应用
7. MERIT: Tensor Transform for Memory-Efficient Vision Processing on Parallel Architectures [O] . Yu-Sheng Lin, Wei-Chao Chen, Shao-Yi Chien 2020

机译：Merit：对并行架构的内存高效视觉处理的张量变换
8. Development of Parallel Architectures for Sensor Array Processing. Volume 2. A Parallel Architecture for Broad-Band Direction-of-Arrival Estimation. [R] . Jamali, M. M., Kwatra, S. C., Tabar, R. F. 1993

机译：传感器阵列处理并行架构的开发。第2卷。宽带波达方向估计的并行体系结构。

MERIT: Tensor Transform for Memory-Efficient Vision Processing on Parallel Architectures

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅