VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing

Najoui Mohamed; Bahtat Mounir; Hatim Anas; Belkouch Said; Chabini Noureddine

首页> 外文期刊>Journal of circuits, systems and computers >VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing

【24h】

VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing

机译：基于VLIW DSP的给定QR分解实时处理的低级指令方案

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an er effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12x and 6x speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.

机译：QR分解（QRD）是几种信号处理应用程序中使用最广泛的数值线性代数（NLA）内核之一。它的实现对系统性能具有相当大的影响。随着处理器体系结构在高性能计算领域的不断发展，必须重新设计QRD算法，以便利用这些新处理器的体系结构功能。但是，在某些处理器体系结构（如超大指令字（VLIW））中，编译器效率不足以更有效地利用可用的计算资源。本文提出了一种有效且优化的方法，以基于VLIW架构的低功耗平台实现Givens QRD。为了克服编译器的效率限制，以使大多数Givens算术运算并行化，我们提出了一种低级指令方案，该方案可以最大化并行度并最小化时钟周期。这项工作的主要贡献如下：（i）基于包括缓存的VLIW功能（即指令级并行性（ILP）和数据级并行性（DLP））的Givens算法的新的并行和快速版本设计属性。（ii）高效的数据管理方法，以避免缓存未命中和内存库冲突。两个DSP平台C6678和AK2H12被用作实现目标。引入的并行QR实现方法平均比标准算法版本和优化的QR例程实现分别平均提高了12倍和6倍。与现有技术相比，所提出的方案实现分别比最近的CPU和DSP实现至少快3.65和2.5倍。

著录项

来源
《Journal of circuits, systems and computers》 |2017年第9期|1750129.1-1750129.26|共26页
作者
Najoui Mohamed; Bahtat Mounir; Hatim Anas; Belkouch Said; Chabini Noureddine;
展开▼
作者单位

Univ Cadi Ayyad, ENSA Marrakech, LGECOS Lab, Marrakech, Morocco;

Ibn Zohr Univ, ENSA Agadir, Agadir, Morocco;

Univ Cadi Ayyad, ENSA Marrakech, LGECOS Lab, Marrakech, Morocco;

Royal Mil Coll Canada, Dept Elect & Comp Engn, Kingston, ON K7K 7B4, Canada;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Instruction scheduling; QR decomposition; parallel givens rotations; ILP; DLP; low-power DSP; VLIW; software pipelining; numerical linear algebra;

机译：指令调度;QR分解;平行给定旋转;ILP;DLP;低功耗DSP;VLIW;软件流水线;数值线性代数;

相似文献

外文文献
中文文献
专利

1. An Embedded Software Scheme for A Real-Time Single-Chip MPEG-2 Encoder System with a VLIW Media Processor Core [J] . Hiroshi Segawa, Yoshinori Matsuura, Satoshi Kumaki IEICE Transactions on Electronics . 2001,第2期

机译：具有VLIW媒体处理器核心的实时单芯片MPEG-2编码器系统的嵌入式软件方案
2. Instruction scheduling and transformation for a VLIW unified reduced instruction set computer/digital signal processor processor with shared register architecture [J] . Cheng-Yu Lee, Min-Chin Hung, Rong-Guey Chang Concurrency and computation: practice and experience . 2014,第1期

机译：具有共享寄存器架构的VLIW统一精简指令集计算机/数字信号处理器处理器的指令调度和转换
3. Givens rotation-based QR decomposition for MIMO systems [J] . Wen Fan, Amir Alimohammad Communications, IET . 2017,第12期

机译：基于MIMO的基于旋转的QR分解
4. Efficient Implementation of Givens QR Decomposition on VLIW DSP Architecture for Orthogonal Matching Pursuit Image Reconstruction [C] . Mohamed Najoui, Anas Hatim, Mounir Bahtat, Mediterranean conference on information communication technologies . 2015

机译：正交匹配追踪图像重建在VLIW DSP架构上Givens QR分解的有效实现
5. A low-cost high-speed twin-prefetching DSP-based shared-memory system for real-time image processing applications. [D] . Christou, Charalambos Stephanou. 1998

机译：一种低成本的，基于DSP的高速双预取共享内存系统，用于实时图像处理应用程序。
6. D-MSR: A Distributed Network Management Scheme for Real-Time Monitoring and Process Control Applications in Wireless Industrial Automation [O] . Pouria Zand, Arta Dilo, Paul Havinga 2013

机译：D-MSR：用于无线工业自动化中实时监视和过程控制应用程序的分布式网络管理方案
7. Hardware Architectures of the QR-Decomposition Based on a Givens Rotation Technique [O] . Alexey V. Sokolovskiy, Evgeny A. Veisov, Valery N. Tyapkin, 2019

机译：基于GIVENS旋转技术的QR分解的硬件架构

VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing

摘要

著录项

相似文献

相关主题

期刊订阅