首页> 外文OA文献 >A QHD-capable parallel H.264 decoder
【2h】

A QHD-capable parallel H.264 decoder

机译:具有QHD功能的并行H.264解码器

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Video coding follows the trend of demanding higher performance every new generation, and therefore could utilize many-cores. A complete parallelization of H.264, which is the most advanced video coding standard, was found to be difficult due to the complexity of the standard. In this paper a parallel implementation of a complete H.264 decoder is presented. Our parallelization strategy exploits function-level as well as data-level parallelism. Function-level parallelism is used to pipeline the H.264 decoding stages. Data-level parallelism is exploited within the two most time consuming stages, the entropy decoding stage and the macroblock decoding stage. The parallelization strategy has been implemented and optimized on three platforms with very different memory architectures, namely an 8-core SMP, a 64-core cc-NUMA, and an 18-core Cell platform. Evaluations have been performed using 4kx2k QHD sequences. On the SMP platform a maximum speedup of 4.5x is achieved. The SMP-implementation is reasonably performance portable as it achieves a speedup of 26.6x on the cc-NUMA system. However, to obtain the highest performance (speedup of 33.4x and throughput of 200 QHD frames per second), several cc-NUMA specific optimizations are necessary such as optimizing the page placement and statically assigning threads to cores. Finally, on the Cell platform a near ideal speedup of 16.5x is achieved by completely hiding the communication latency.
机译:视频编码遵循每一代都要求更高性能的趋势,因此可以利用多核。由于该标准的复杂性,很难对H.264(它是最先进的视频编码标准)进行完全并行化。本文提出了一个完整的H.264解码器的并行实现。我们的并行化策略利用功能级以及数据级并行性。功能级并行性用于流水线H.264解码阶段。在两个最耗时的阶段,即熵解码阶段和宏块解码阶段,利用了数据级并行性。并行化策略已在具有非常不同的内存架构的三个平台上实施和优化,即8核SMP,64核cc-NUMA和18核Cell平台。使用4kx2k QHD序列进行了评估。在SMP平台上,最高可实现4.5倍的加速。 SMP实现具有可移植的性能,因为它在cc-NUMA系统上实现了26.6倍的加速。但是,要获得最高的性能(33.4倍的加速和每秒200个QHD帧的吞吐量),必须进行一些cc-NUMA特定的优化,例如优化页面布局和将线程静态分配给核心。最终,在Cell平台上,通过完全隐藏通信延迟,可以实现接近16.5倍的理想加速。

著录项

  • 作者

    Chi Chi Ching; Juurlink Ben;

  • 作者单位
  • 年度 2011
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号