...
首页> 外文期刊>Signal Processing, IEEE Transactions on >Memory Efficient Modular VLSI Architecture for Highthroughput and Low-Latency Implementation of Multilevel Lifting 2-D DWT
【24h】

Memory Efficient Modular VLSI Architecture for Highthroughput and Low-Latency Implementation of Multilevel Lifting 2-D DWT

机译:高效存储器的模块化VLSI架构,用于多级提升2-D DWT的高吞吐量和低延迟实现

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we present a modular and pipeline architecture for lifting-based multilevel 2-D DWT, without using line-buffer and frame-buffer. Overall area-delay product is reduced in the proposed design by appropriate partitioning and scheduling of the computation of individual decomposition-levels. The processing for different levels is performed by a cascaded pipeline structure to maximize the hardware utilization efficiency (HUE). Moreover, the proposed structure is scalable for high-throughput and area-constrained implementation. We have removed all the redundancies resulting from decimated wavelet filtering to maximize the HUE. The proposed design involves $L$ pyramid algorithm (PA) units and one recursive pyramid algorithm (RPA) unit, where $R=N/P$ , $L=lceil log_{4}Prceil$ and $P$ is the input block size, $M$ and $N$ , respectively, being the height and width of the image. The entire multilevel DWT is computed by the proposed structure in $MR$ cycles. The proposed structure has $O(8Rtimes 2^{L})$ cycles of output latency, which is very small compared to the latency of the existing structures. Interestingly, the proposed structure does not require any line-buffer or frame-buffer, unlike the existing folded structures which otherwise require a line-buffer of size $O(N)$ and frame-buffer of size $O(M/2times N-n-n/2)$ for multilevel 2-D computation. Instead of those buffers, the proposed structure involves only local registers and RAM of size $O(N)$. The saving of line-buffer and frame-buffer achieved by the proposed design is an important advantage, since the image size could very often be as large as 512 $times$ 512. From the simulation results we find that, the proposed scalable structure offers better slice-delay-product (SDP) for higher throughput of implementation since the on-chip memory of this structure remains almost unchanged with input block size. It has 17% less SDP than the best of the corresponding existing structures on average, for different input-block sizes and image sizes. It involves 1.92 times more transistors, but offers 12.2 times higher throughput and consumes 52% less power per output (PPO) compared to the other, on average for different input sizes.
机译:在本文中,我们提出了一种用于基于提升的多层2-D DWT的模块化和流水线架构,无需使用行缓冲区和帧缓冲区。通过适当划分和调度各个分解级别的计算,在建议的设计中减少了总的面积延迟积。通过级联流水线结构执行不同级别的处理,以最大程度地提高硬件利用效率(HUE)。而且,所提出的结构对于高通量和面积受限的实施是可扩展的。我们已消除了因抽取小波滤波而产生的所有冗余,以最大程度地提高HUE。提议的设计涉及$ L $金字塔算法(PA)单元和一个递归金字塔算法(RPA)单元,其中$ R = N / P $,$ L = lceil log_ {4} Prceil $和$ P $是输入块size,$ M $和$ N $分别是图像的高度和宽度。整个多级DWT由建议的结构以$ MR $个周期计算。所提出的结构具有$ O(8Rtimes 2 ^ {L})$个周期的输出延迟,与现有结构的延迟相比,该周期很小。有趣的是,与现有的折叠结构不同,拟议的结构不需要任何行缓冲区或帧缓冲区,否则,折叠结构需要大小为$ O(N)$的行缓冲区和大小为$ O(M / 2×Nnn的帧缓冲区) / 2)$用于多层2-D计算。代替那些缓冲器,所提出的结构仅涉及本地寄存器和大小为$ O(N)$的RAM。通过设计方案实现的行缓冲器和帧缓冲器的节省是一个重要的优势,因为图像大小通常可以高达512 $乘以512。从仿真结果我们发现,提出的可伸缩结构提供了更好的切片延迟乘积(SDP),实现更高的实现吞吐量,因为这种结构的片内存储器在输入块大小方面几乎保持不变。对于不同的输入块大小和图像大小,它的SDP平均比同类最佳现有结构的SDP低17%。对于不同的输入大小,它平均比其他晶体管多1.92倍,但吞吐量却高出12.2倍,每输出功率(PPO)消耗的功率则比其他少52%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号