首页> 外文期刊>Neural, Parallel & Scientific Computations >Codevelopment of Multi-Level Instruction Set Architecture and Hardware for an Efficient Matrix Processor
【24h】

Codevelopment of Multi-Level Instruction Set Architecture and Hardware for an Efficient Matrix Processor

机译:高效矩阵处理器的多级指令集架构和硬件的共同开发

获取原文
获取原文并翻译 | 示例

摘要

The instruction set architecture (ISA) is the part of the processor that is visible to the programmer or compiler writer. Multi-level ISA is proposed to explicitly communicate data parallelism to hardware (processor) in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques. This paper presents the codevelopment of multi-level ISA and hardware for an efficient matrix processor called Mat-Core. Mat-Core extends a general-purpose scalar processor with a matrix unit for processing vector/matrix data. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute scalar-matrix, vector-matrix, and matrix-matrix instructions in addition to scalar-vector and vector-vector instructions. Mat-Core leads to a compiler model that is efficient both in terms of performance and executable code size. On four parallel lanes Mat-Core and matrix registers of size 8×4 or 32 elements, our results show performances of about 1.6, 2.1, 4.1, and 6.4 FLOPs per clock cycle achieved on scalar-vector multiplication, SAXPY, vector-matrix multiplication, and matrix-matrix multiplication, respectively.
机译:指令集体系结构(ISA)是处理器的一部分,对于程序员或编译器编写者而言是可见的。提出了多级ISA,以紧凑的方式将数据并行性显式地传达给硬件(处理器),而不是使用复杂的硬件进行动态提取或使用复杂的编译器技术进行静态提取。本文介绍了一种称为Mat-Core的高效矩阵处理器的多层ISA和硬件的共同开发。 Mat-Core扩展了具有矩阵单元的通用标量处理器,用于处理矢量/矩阵数据。为了隐藏内存延迟,扩展矩阵单元被分解为两个部分:地址生成和数据计算,它们通过数据队列进行通信。像矢量架构一样,数据计算单元组织在并行通道中。但是,在并行通道上,Mat-Core除了标量矢量和矢量矢量指令外,还可以执行标量矩阵,矢量矩阵和矩阵矩阵指令。 Mat-Core导致了一种编译器模型,该模型在性能和可执行代码大小方面均十分有效。在大小为8×4或32个元素的四个并行通道Mat-Core和矩阵寄存器上,我们的结果显示,在标量矢量乘法,SASPY和矢量矩阵乘法下,每个时钟周期可获得约1.6、2.1、4.1和6.4 FLOP的性能。 ,以及矩阵矩阵乘法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号