首页> 外文OA文献 >Evaluation of parallel H.264 decoding strategies for the Cell Broadband Engine
【2h】

Evaluation of parallel H.264 decoding strategies for the Cell Broadband Engine

机译:小区宽带引擎并行H.264解码策略的评估

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

How to develop efficient and scalable parallel applications is the key challenge for emerging many-core architectures. We investigate this question by implementing and comparing two parallel H.264 decoders on the Cell architecture. It is expected that future many-cores will use a Cell-like local store memory hierarchy, rather than a non-scalable shared memory. The two implemented parallel algorithms, the Task Pool (TP) and the novel Ring-Line (RL) approach, both exploit macroblock-level parallelism. The TP implementation follows the master-slave paradigm and is very dynamic so that in theory perfect load balancing can be achieved. The RL approach is distributed and more predictable in the sense that the mapping of macroblocks to processing elements is fixed. This allows to better exploit data locality, to overlap communication with computation, and to reduce communication and synchronization overhead. While TP is more scalable in theory, the actual scalability favors RL. Using 16 SPEs, RL obtains a scalability of 12x, while TP achieves only 10.3x. More importantly, the absolute performance of RL is much higher. Using 16 SPEs, RL achieves a throughput of 139.6 frames per second (fps) while TP achieves only 76.6 fps. A large part of the additional performance advantage is due to hiding the memory latency. From the results we conclude that in order to fully leverage the performance of future many-cores, a centralized master should be avoided and the mapping of tasks to cores should be predictable in order to be able to hide the memory latency.
机译:如何开发高效且可扩展的并行应用程序是新兴的多核体系结构的关键挑战。我们通过在Cell体系结构上实现和比较两个并行的H.264解码器来研究此问题。预计未来的多核将使用类似于单元的本地存储内存层次结构,而不是不可伸缩的共享内存。任务池(TP)和新颖的环行(RL)方法这两种已实现的并行算法都利用了宏块级并行性。 TP实现遵循主从范式,并且非常动态,因此在理论上可以实现完美的负载平衡。从宏块到处理元素的映射是固定的意义上讲,RL方法是分布式的并且更可预测。这允许更好地利用数据局部性,使通信与计算重叠,并减少通信和同步开销。虽然TP在理论上更具可伸缩性,但实际可伸缩性有利于RL。 RL使用16个SPE,可获得12倍的可扩展性,而TP仅达到10.3倍。更重要的是,RL的绝对性能要高得多。使用16个SPE,RL实现每秒139.6帧(fps)的吞吐量,而TP仅达到76.6 fps。其他性能优势的很大一部分是由于隐藏了内存延迟。从结果可以得出结论,为了充分利用未来多核的性能,应避免使用集中式主服务器,并且应可预测任务到核的映射,以便能够隐藏内存延迟。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号