首页> 外文期刊>Journal of circuits, systems and computers >VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing
【24h】

VLIW DSP-Based Low-Level Instruction Scheme of Givens QR Decomposition for Real-Time Processing

机译:基于VLIW DSP的给定QR分解实时处理的低级指令方案

获取原文
获取原文并翻译 | 示例
           

摘要

QR decomposition (QRD) is one of the most widely used numerical linear algebra (NLA) kernels in several signal processing applications. Its implementation has a considerable and an important impact on the system performance. As processor architectures continue to gain ground in the high-performance computing world, QRD algorithms have to be redesigned in order to take advantage of the architectural features on these new processors. However, in some processor architectures like very large instruction word (VLIW), compiler efficiency is not enough to make an er effective use of available computational resources. This paper presents an efficient and optimized approach to implement Givens QRD in a low-power platform based on VLIW architecture. To overcome the compiler efficiency limits to parallelize the most of Givens arithmetic operations, we propose a low-level instruction scheme that could maximize the parallelism rate and minimize clock cycles. The key contributions of this work are as follows: (i) New parallel and fast version design of Givens algorithm based on the VLIW features (i.e., instruction-level parallelism (ILP) and data-level parallelism (DLP)) including the cache memory properties. (ii) Efficient data management approach to avoid cache misses and memory bank conflicts. Two DSP platforms C6678 and AK2H12 were used as targets for implementation. The introduced parallel QR implementation method achieves, in average, more than 12x and 6x speedups over the standard algorithm version and the optimized QR routine implementations, respectively. Compared to the state of the art, the proposed scheme implementation is at least 3.65 and 2.5 times faster than the recent CPU and DSP implementations, respectively.
机译:QR分解(QRD)是几种信号处理应用程序中使用最广泛的数值线性代数(NLA)内核之一。它的实现对系统性能具有相当大的影响。随着处理器体系结构在高性能计算领域的不断发展,必须重新设计QRD算法,以便利用这些新处理器的体系结构功能。但是,在某些处理器体系结构(如超大指令字(VLIW))中,编译器效率不足以更有效地利用可用的计算资源。本文提出了一种有效且优化的方法,以基于VLIW架构的低功耗平台实现Givens QRD。为了克服编译器的效率限制,以使大多数Givens算术运算并行化,我们提出了一种低级指令方案,该方案可以最大化并行度并最小化时钟周期。这项工作的主要贡献如下:(i)基于包括缓存的VLIW功能(即指令级并行性(ILP)和数据级并行性(DLP))的Givens算法的新的并行和快速版本设计属性。 (ii)高效的数据管理方法,以避免缓存未命中和内存库冲突。两个DSP平台C6678和AK2H12被用作实现目标。引入的并行QR实现方法平均比标准算法版本和优化的QR例程实现分别平均提高了12倍和6倍。与现有技术相比,所提出的方案实现分别比最近的CPU和DSP实现至少快3.65和2.5倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号