...
首页> 外文期刊>Computer architecture news >SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures
【24h】

SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures

机译:SIMD Defragmenter:在数据并行体系结构上的高效ILP实现

获取原文
获取原文并翻译 | 示例

摘要

Single-instruction multiple-data (SIMD) accelerators provide an energy-efficient platform to scale the performance of mobile systems while still retaining post-programmability. The central challenge is translating the parallel resources of the SIMD hardware into real application performance. In scientific applications, automatic vectorization techniques have proven quite effective at extracting large levels of data-level parallelism (DLP). However, vectorization is often much less effective for media applications due to low trip count loops, complex control flow, and non-uniform execution behavior. As a result, SIMD lanes remain idle due to insufficient DLP. To attack this problem, this paper proposes a new vectorization pass called SIMD Defragmenter to uncover hidden DLP that lurks below the surface in the form of instruction-level parallelism (ILP). The difficulty is managing the data packing/unpacking overhead that can easily exceed the benefits gained through SIMD execution. The SIMD degragmenter overcomes this problem by identifying groups of compatible instructions (subgraphs) that can be executed in parallel across the SIMD lanes. By SIMDizing in bulk at the subgraph level, packing/unpacking overhead is minimized. On a 16-lane SIMD processor, experimental results show that SIMD defragmentation achieves a mean 1.6x speedup over traditional loop vectorization and a 31% gain over prior research approaches for converting ILP to DLP.
机译:单指令多数据(SIMD)加速器提供了一个节能平台,可扩展移动系统的性能,同时仍保留后可编程性。面临的主要挑战是将SIMD硬件的并行资源转换为实际的应用程序性能。在科学应用中,自动矢量化技术已证明在提取大量数据级并行性(DLP)方面非常有效。但是,由于跳闸次数循环少,控制流复杂以及执行行为不统一,矢量化对于媒体应用而言通常效率不高。结果,由于DLP不足,SIMD通道保持空闲。为了解决这个问题,本文提出了一种新的向量化通道,称为SIMD碎片整理程序,以发现以指令级并行(ILP)形式潜伏在表面之下的隐藏DLP。困难在于管理数据打包/拆包开销,该开销很容易超过通过SIMD执行所获得的收益。 SIMD碎片整理程序通过识别可跨SIMD通道并行执行的兼容指令(子图)组来解决此问题。通过在子图级别进行批量SIMD,可以最大程度地减少打包/拆包的开销。在16通道SIMD处理器上,实验结果表明,与传统的循环矢量化相比,SIMD碎片整理的平均速度提高了1.6倍,比将ILP转换为DLP的现有研究方法的平均速度提高了31%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号