...
首页> 外文期刊>Parallel Processing Letters >SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR
【24h】

SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR

机译:解耦的通用矩阵处理器的系统实现和性能评估

获取原文
获取原文并翻译 | 示例

摘要

Technological advances in IC manufacturing provide us with the capability to integrate more and more functionality into a single chip. Today's modern processors have nearly one billion transistors on a single chip. With the increasing complexity of today's system, the designs have to be modeled at a high-level of abstraction before partitioning into hardware and software components for final implementation. This paper explains in detail the implementation and performance evaluation of a matrix processor called Mat-Core with SystemC (system level modeling language). Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute matrix-scalar, matrix-vector, and matrix-matrix instructions in addition to vector-scalar and vector-vector instructions. For controlling the execution of vector/matrix instructions on the matrix core, this paper extends the well-known scoreboard technique. Furthermore, the performance of Mat-Core is evaluated on vector and matrix kernels. Our results show that the performance of four lanes Mat-Core with matrix registers of size 4 × 4 or 16 elements each, queues size of 10, start up time of 6 clock cycles, and memory latency of 10 clock cycles is about 0.94, 1.3, 2.3, 1.6, 2.3, and 5.5 FLOPs per clock cycle; achieved on scalar-vector multiplication, SAXPY, Givens, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication, respectively.
机译:IC制造技术的进步为我们提供了将越来越多的功能集成到单个芯片中的能力。当今的现代处理器在单个芯片上具有近十亿个晶体管。随着当今系统的复杂性不断提高,必须先对设计进行建模,然后再将其划分为硬件和软件组件以进行最终实现。本文详细解释了采用SystemC(系统级建模语言)的称为Mat-Core的矩阵处理器的实现和性能评估。 Mat-Core是一个研究处理器,旨在利用每个IC越来越多的晶体管来改善广泛应用的性能。它扩展了具有矩阵单元的通用标量处理器。为了隐藏内存延迟,扩展矩阵单元被分解为两个部分:地址生成和数据计算,它们通过数据队列进行通信。像矢量架构一样,数据计算单元组织在并行通道中。但是,在并行通道上,Mat-Core除了可以执行矢量标量和矢量矢量指令外,还可以执行矩阵标量,矩阵矢量和矩阵矩阵指令。为了控制矩阵核上的向量/矩阵指令的执行,本文扩展了众所周知的记分板技术。此外,Mat-Core的性能在矢量和矩阵内核上进行了评估。我们的结果表明,四个通道的Mat-Core的矩阵寄存器大小分别为4×4或16个元素,队列大小为10,启动时间为6个时钟周期,而存储器等待时间为10个时钟周期,其性能分别约为0.94、1.3每个时钟周期2.3、1.6、2.3和5.5 FLOP;分别在标量向量乘法,SASXPY,Givens,rank-1更新,向量矩阵乘法和矩阵矩阵乘法上实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号