首页> 外文期刊>International journal of parallel programming >Parallel SIMD CPU and GPU Implementations of Berlekamp-Massey Algorithm and Its Error Correction Application
【24h】

Parallel SIMD CPU and GPU Implementations of Berlekamp-Massey Algorithm and Its Error Correction Application

机译:Berlekamp-Massey算法的并行SIMD CPU和GPU实现及其纠错应用

获取原文
获取原文并翻译 | 示例

摘要

The Berlekamp-Massey algorithm finds the shortest linear feedback shift register for a binary input sequence. A wide range of applications like cryptography and digital signal processing use this algorithm. This research proposes novel parallel mechanisms offered by heterogeneous CPU and GPU hardwares in order to achieve the best possible performance for BMA. The proposed bitwise implementation of the BMA algorithm is almost 35 times faster than state of the art implementations. This further improvement is achieved by using SIMD instructions which provides data level parallelism. This new approach can be 4.6 and 35 times faster than a bitwise CPU and state of the art implementations, respectively. In order to achieve the highest possible speedup over a multi-core structure, a multi-threading implementation is introduced in this research. By leveraging on OpenMP we were able to obtain a speedup of 10 times over 12 cores server. The GPU device with thousands of processing cores can bring great speedup over the best CPU implementation. Two other parallel mechanisms offered by GPU are concurrent kernel execution and streaming. They achieve 14.5 and 2.2 times of speedup compared to CPU serial and typical CUDA implementations, respectively. Also, the performance of the openMP code with SIMD instructions is compared with GPU stream implementation. The effectiveness of the proposed method is evaluated in a real world error correction application and it achieves 6.8 times of speedup.
机译:Berlekamp-Massey算法为二进制输入序列找到最短的线性反馈移位寄存器。该算法可用于诸如密码学和数字信号处理之类的广泛应用。这项研究提出了异构CPU和GPU硬件提供的新颖并行机制,以实现BMA的最佳性能。提出的BMA算法的按位实现比最先进的实现快35倍。通过使用提供数据级别并行性的SIMD指令可以实现进一步的改进。这种新方法分别比按位CPU和最新实现快4.6和35倍。为了在多核结构上实现最高的加速,本研究引入了多线程实现。利用OpenMP,我们可以在12核服务器上获得10倍的加速。具有数千个处理核心的GPU设备可以大大提高最佳CPU实现的速度。 GPU提供的另外两个并行机制是并发内核执行和流传输。与CPU串行和典型CUDA实现相比,它们分别实现了14.5和2.2倍的加速。此外,将带有SIMD指令的openMP代码的性能与GPU流实现进行了比较。在现实世界中的纠错应用中评估了所提方法的有效性,并实现了6.8倍的加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号