...
首页> 外文期刊>Concurrency and computation: practice and experience >Efficient parallel implementation of three-point viterbi decoding algorithm on CPU, GPU, and FPGA
【24h】

Efficient parallel implementation of three-point viterbi decoding algorithm on CPU, GPU, and FPGA

机译:在CPU,GPU和FPGA上高效并行实现三点维特比解码算法

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

In wireless communication, Viterbi decoding algorithm (VDA) is the one of most popular channel decodingrnalgorithms, which is widely used in WLAN, WiMAX, or 3G communications. However, the throughput ofrnViterbi decoder is constrained by the convolutional characteristic. Recently, the three-point VDA (TVDA)rnwas proposed to solve this problem. In TVDA, the whole procedure can be divided into three phases, thernforward, trace-back, and decoding phases. In this paper, we analyze the parallelism of TVDA and proposernparallel TVDA on the multi-core CPU, graphics processing unit (GPU), and field programmable gate arrayrn(FPGA). We demonstrate approaches that fully exploit its performance potential on CPU, GPU, and FPGArncomputing platforms. For CPU platforms, we perform two optimization methods, single instruction multiplerndata and multithreading to gain over 145× speedup over the naive CPU version on a quad-core CPU platform.rnFor GPU platforms, we propose the combination of cached memory optimization, coalesced globalrnmemory accesses, codeword packing scheme, and asynchronous data transition, achieving the throughputrnof 404.65 Mbps and 12× speedup over initial GPU versions on an NVIDIA GeForce GTX580 card andrn7× speedup over Intel quad-core CPU i5-2300, under the same manufacturing year and both with fullyrnoptimized schemes. In addition, for FPGA platforms, we customize a radix-4 pipelined architecture for thernTVDA in a 45-nm FPGA chip from Xilinx (XC6VLX760). Under 209.15-MHz clock rate, it achieves arnthroughput of 418.30 Mbps. Finally, we also discuss the performance evaluation and efficiency comparisonrnof different flexible architectures for real-time Viterbi decoding in terms of the decoding throughput, powerrnconsumption, optimization schemes, programming costs, and price costs.
机译:在无线通信中,维特比解码算法(VDA)是最流行的信道解码算法之一,已广泛用于WLAN,WiMAX或3G通信中。然而,维特比解码器的吞吐量受到卷积特性的限制。最近,提出了三点VDA(TVDA)来解决这个问题。在TVDA中,整个过程可以分为三个阶段,即前进,追溯和解码阶段。在本文中,我们分析了TVDA的并行性,并在多核CPU,图形处理单元(GPU)和现场可编程门阵列(FPGA)上提出了并行TVDA。我们演示了充分利用其在CPU,GPU和FPGArncomputing平台上的性能潜力的方法。对于CPU平台,我们执行两种优化方法,即单指令多数据和多线程,以使四核CPU平台上的朴素CPU版本的速度提高145倍以上。对于GPU平台,我们建议结合使用缓存内存优化和合并的全局内存访问,代码字打包方案和异步数据转换,在同一个制造年份下,在同一个制造年份下,与NVIDIA GeForce GTX580卡上的初始GPU版本相比,可实现吞吐率404.65 Mbps和12倍加速,在英特尔四核CPU i5-2300上可实现7倍加速。完全优化的方案。此外,对于FPGA平台,我们在Xilinx(XC6VLX760)的45 nm FPGA芯片中为rnTVDA自定义了radix-4流水线架构。在209.15-MHz时钟速率下,它可实现418.30 Mbps的吞吐量。最后,我们还从解码吞吐量,功耗,优化方案,编程成本和价格成本等方面讨论了实时维特比解码的不同灵活体系结构的性能评估和效率比较。

著录项

  • 来源
  • 作者

    Rongchun Li; Yong Dou; Dan Zou;

  • 作者单位

    National Laboratory for Parallel and Distribution Processing, National University of Defense Technology,Changsha, China;

    National Laboratory for Parallel and Distribution Processing, National University of Defense Technology,Changsha, China;

    National Laboratory for Parallel and Distribution Processing, National University of Defense Technology,Changsha, China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    viterbi; SDR; SSE; OpenMP; GPU; CUDA; FPGA;

    机译:维特比特别提款权;上证所;OpenMP;GPU;CUDA;现场可编程门阵列;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号