首页> 外文会议>International Conference on Application-specific Systems, Architectures and Processors >Refine and Recycle: A Method to Increase Decompression Parallelism
【24h】

Refine and Recycle: A Method to Increase Decompression Parallelism

机译:提炼和回收利用:一种增加减压并行性的方法

获取原文

摘要

Rapid increases in storage bandwidth, combined with a desire for operating on large datasets interactively, drives the need for improvements in high-bandwidth decompression. Existing designs either process only one token per cycle or process multiple tokens per cycle with low area efficiency and/or low clock frequency. We propose two techniques to achieve high single-decoder throughput at improved efficiency by keeping only a single copy of the history data across multiple BRAMs and operating on each BRAM independently. A first stage efficiently refines the tokens into commands that operate on a single BRAM and steers the commands to the appropriate one. In the second stage, a relaxed execution model is used where each BRAM command executes immediately and those with invalid data are recycled to avoid stalls caused by the read-after-write dependency. We apply these techniques to Snappy decompression and implement a Snappy decompression accelerator on a CAPI2-attached FPGA platform equipped with a Xilinx VU3P FPGA. Experimental results show that our proposed method achieves up to 7.2 GB/s output throughput per decompressor, with each decompressor using 14.2% of the logic and 7% of the BRAM resources of the device. Therefore, a single decompressor can easily keep pace with an NVMe device (PCIe Gen3 x4) on a small FPGA, while a larger device, integrated on a host bridge adapter and instantiating multiple decompressors, can keep pace with the full OpenCAPI 3.0 bandwidth of 25 GB/s.
机译:存储带宽的快速增加,以及对交互式操作大型数据集的需求,推动了对高带宽解压缩的改进需求。现有设计要么每个周期仅处理一个令牌,要么每个周期以低区域效率和/或低时钟频率处理多个令牌。我们提出了两种技术,以通过在多个BRAM上仅保留历史数据的单个副本并在每个BRAM上独立运行来以提高的效率实现高单解码器吞吐量。第一阶段有效地将令牌精炼为可在单个BRAM上运行的命令,并将这些命令引导至适当的命令。在第二阶段中,使用宽松的执行模型,其中每个BRAM命令立即执行,具有无效数据的命令将被回收,以避免写入后读取相关性引起的停顿。我们将这些技术应用于Snappy解压缩,并在配备有Xilinx VU3P FPGA的CAPI2连接的FPGA平台上实现Snappy解压缩加速器。实验结果表明,我们提出的方法每个解压缩器可实现高达7.2 GB / s的输出吞吐量,每个解压缩器均使用设备的14.2%的逻辑和7%的BRAM资源。因此,单个解压缩器可以轻松地与小型FPGA上的NVMe设备(PCIe Gen3 x4)保持同步,而集成在主机桥适配器上并实例化多个解压缩器的较大设备可以与完整的OpenCAPI 3.0带宽保持同步(25)。 GB /秒。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号