首页> 外文期刊>Journal of Parallel and Distributed Computing >Design, implementation, and evaluation of a low-complexity vector-core for executing scalar/vector instructions
【24h】

Design, implementation, and evaluation of a low-complexity vector-core for executing scalar/vector instructions

机译:用于执行标量/矢量指令的低复杂度矢量核的设计,实现和评估

获取原文
获取原文并翻译 | 示例

摘要

This paper proposes a low-complexity vector-core called LcVc for executing both scalar and vector instructions on the same execution datapath. A unified register file in the decode stage is used for storing both scalar operands and vector elements. The execution stage accepts a new set of operands each cycle and produces a new result. Rather than issuing a vector instruction (1-D operations) as a whole, each vector operation is issued sequentially with the existing scalar issue hardware. In the first implementation of LcVc, all loads and stores of registers take place from the data cache in the memory access stage in a rate of one element per clock cycle. The complete design of our proposed LcVc processor is implemented using VHDL targeting the Xilinx FPCA Spartan 3E, xc3s 1600e-4-fg320 device. The total number of slices required for implementing LcVc is 1778, where the number of slice flip-flops is 538 and the number of 4-input LUTs is 3706: 1914 for logic and 1792 for RAMs. Moreover, our performance evaluation results show that the speedup of executing vector addition, vector scaling, SAXPY, and matrix-matrix multiplication on LcVc over the scalar execution are 2.3, 2.5, 1.9, and 3, respectively. The hardware required to support the enhanced vector capability is insignificant (5%), which results in reducing the area per core and increasing the number of cores available in a given chip area.
机译:本文提出了一种称为LcVc的低复杂度向量核,用于在同一执行数据路径上执行标量和向量指令。解码阶段的统一寄存器文件用于存储标量操作数和向量元素。执行阶段在每个周期接受一组新的操作数并产生新的结果。整体上不发布矢量指令(1-D操作),而是使用现有的标量发布硬件顺序地发布每个矢量操作。在LcVc的第一个实现中,寄存器的所有加载和存储都以每个时钟周期一个元素的速率从内存访问阶段的数据高速缓存中进行。我们针对LcVc处理器提出的完整设计是使用针对Xilinx FPCA Spartan 3E,xc3s 1600e-4-fg320器件的VHDL实现的。实现LcVc所需的切片总数为1778,其中切片触发器的数目为538,四输入LUT的数目为3706:逻辑为1914,RAM为1792。此外,我们的性能评估结果表明,在标量执行过程中,在LcVc上执行向量加法,向量缩放,SAXPY和矩阵矩阵乘法的速度分别为2.3、2.5、1.9和3。支持增强的矢量功能所需的硬件微不足道(5%),这导致减少了每个内核的面积并增加了给定芯片面积中可用内核的数量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号