SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR

MOSTAFA I. SOLIMAN; ABDULMAJID F. Al-JUNAID

首页> 外文期刊>Parallel Processing Letters >SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR

【24h】

SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR

机译：解耦的通用矩阵处理器的系统实现和性能评估

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Technological advances in IC manufacturing provide us with the capability to integrate more and more functionality into a single chip. Today's modern processors have nearly one billion transistors on a single chip. With the increasing complexity of today's system, the designs have to be modeled at a high-level of abstraction before partitioning into hardware and software components for final implementation. This paper explains in detail the implementation and performance evaluation of a matrix processor called Mat-Core with SystemC (system level modeling language). Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute matrix-scalar, matrix-vector, and matrix-matrix instructions in addition to vector-scalar and vector-vector instructions. For controlling the execution of vector/matrix instructions on the matrix core, this paper extends the well-known scoreboard technique. Furthermore, the performance of Mat-Core is evaluated on vector and matrix kernels. Our results show that the performance of four lanes Mat-Core with matrix registers of size 4 × 4 or 16 elements each, queues size of 10, start up time of 6 clock cycles, and memory latency of 10 clock cycles is about 0.94, 1.3, 2.3, 1.6, 2.3, and 5.5 FLOPs per clock cycle; achieved on scalar-vector multiplication, SAXPY, Givens, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication, respectively.

机译：IC制造技术的进步为我们提供了将越来越多的功能集成到单个芯片中的能力。当今的现代处理器在单个芯片上具有近十亿个晶体管。随着当今系统的复杂性不断提高，必须先对设计进行建模，然后再将其划分为硬件和软件组件以进行最终实现。本文详细解释了采用SystemC（系统级建模语言）的称为Mat-Core的矩阵处理器的实现和性能评估。 Mat-Core是一个研究处理器，旨在利用每个IC越来越多的晶体管来改善广泛应用的性能。它扩展了具有矩阵单元的通用标量处理器。为了隐藏内存延迟，扩展矩阵单元被分解为两个部分：地址生成和数据计算，它们通过数据队列进行通信。像矢量架构一样，数据计算单元组织在并行通道中。但是，在并行通道上，Mat-Core除了可以执行矢量标量和矢量矢量指令外，还可以执行矩阵标量，矩阵矢量和矩阵矩阵指令。为了控制矩阵核上的向量/矩阵指令的执行，本文扩展了众所周知的记分板技术。此外，Mat-Core的性能在矢量和矩阵内核上进行了评估。我们的结果表明，四个通道的Mat-Core的矩阵寄存器大小分别为4×4或16个元素，队列大小为10，启动时间为6个时钟周期，而存储器等待时间为10个时钟周期，其性能分别约为0.94、1.3每个时钟周期2.3、1.6、2.3和5.5 FLOP;分别在标量向量乘法，SASXPY，Givens，rank-1更新，向量矩阵乘法和矩阵矩阵乘法上实现。

著录项

来源
《Parallel Processing Letters 》 |2010年第2期| 共19页
作者
MOSTAFA I. SOLIMAN; ABDULMAJID F. Al-JUNAID;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算机软件 ;
关键词
High performance computing; Multi-level ISA; Scoreboarding; SystemC im-plementation; Vector/matrix processing; Performance evaluation;

机译：高性能计算;多层ISA;记分板;SystemC实现;矢量/矩阵处理;性能评估;

相似文献

外文文献
中文文献
专利

1. SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR [J] . MOSTAFA I. SOLIMAN ABDULMAJID F. Al-JUNAID Parallel Processing Letters . 2010 ,第2期

机译：解耦的通用矩阵处理器的系统实现和性能评估
2. MAT-CORE: A DECOUPLED MATRIX CORE EXTENSION FOR GENERAL-PURPOSE PROCESSORS [J] . MOSTAFA I. SOLIMAN Neural, Parallel & Scientific Computations . 2011 ,第1a2期

机译：MAT-CORE：用于通用处理器的去耦矩阵核心扩展
3. Simple super-matrix processor: Implementation and performance evaluation [J] . Mostafa I. Soliman, Elsayed A. Elsayed Journal of Parallel and Distributed Computing . 2015 ,第sepa期

机译：简单的超级矩阵处理器：实现和性能评估
4. SystemC implementation of mat-core: A matrix core extension for general-purpose processors [C] . Soliman M.I., Al-Junaid A.F. Design amp; Technology of Integrated Systems in Nanoscal Era, 2009. DTIS '09 . 2009

机译：垫芯的SystemC实现：通用处理器的矩阵芯扩展
5. Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors [D] . Arunkumar, Akhil. 2018

机译：现代高性能通用处理器的内存子系统优化技术
6. Implementing clinical process management of vascular wounds in a tertiary facility: impact evaluation of a performance improvement project [O] . Giampiero Avruscio, Ilaria Tocco-Tussardi, Greta Bordignon, 2017

机译：在第三机构中实施血管伤口的临床过程管理：绩效改进项目的影响评估
7. Design and implementation of high performance matrix inversion based on reconfigurable processor [O] . Kun Wang, Li Li, Feng Han, 2016

机译：基于可重构处理器的高性能矩阵反转的设计与实现

SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR

摘要

著录项

相似文献

相关主题

期刊订阅