首页> 外文期刊>ACM Transactions on Embedded Computing Systems >Symbolic Multi-Level Loop Mapping of Loop Programs for Massively Parallel Processor Arrays
【24h】

Symbolic Multi-Level Loop Mapping of Loop Programs for Massively Parallel Processor Arrays

机译:大型并行处理器阵列的循环程序符号多级环映射

获取原文
获取原文并翻译 | 示例
       

摘要

Today's MPSoCs (multiprocessor systems-on-chip) have brought up massively parallel processor array accelerators that may achieve a high computational efficiency by exploiting multiple levels of parallelism and different memory hierarchies. Such parallel processor arrays are perfect targets, particularly for the acceleration of nested loop programs due to their regular and massively parallel nature. However, existing loop parallelization techniques are often unable to exploit multiple levels of parallelism and are either I/O or memory bounded. Furthermore, if the number of available processing elements becomes only known at runtime-as in adaptive systems-static approaches fail. In this article, we solve some of these problems by proposing a hybrid compile/runtime multi-level symbolic parallelization technique that is able to: (a) exploit multiple levels of parallelism as well as (b) different memory hierarchies, and (c) to match the I/O or memory capabilities of the target architecture for scenarios where the number of available processing elements is only known at runtime. Our proposed technique consists of two compile-time transformations: (a) symbolic hierarchical tiling followed by (b) symbolic multi-level scheduling. The tiling levels scheduled in parallel exploit different levels of parallelism, whereas the sequential one, different memory hierarchies. Furthermore, by tuning the size of the tiles on the individual levels, a tradeoff between the necessary I/O-bandwidth and memory is possible, which facilitates obeying resource constraints. The resulting schedules are symbolic with respect to the problem size and tile sizes. Thus, the number of processing elements to map onto does not need to be known at compile time. At runtime, when the number of available processors becomes known, a simple prologue chooses a feasible schedule with respect to I/O and memory constraints that is latency-optimal for the chosen tile size. In summary, our approach determines the set of feasible, latency-optimal symbolic loop schedule candidates at compile time, from which one is dynamically selected at runtime. This approach exploits multiple levels of parallelism, is independent of the problem size of the loop nest, and thereby avoids any expensive re-compilation at runtime. This is particularly important for low cost and memory-scarce embedded MPSoC platforms that may not afford to host a just-in-time compiler.
机译:今天的MPSoC(多处理器系统上的片上)已经通过利用多个水平的并行度和不同的存储层次结构来提高巨大的并行处理器阵列加速器,该加速器可以实现高计算效率。这种并行处理器阵列是完美的目标,特别是由于它们的常规和大规模平行的性质而加速嵌套环路程序。然而,现有的循环并行化技术通常无法利用多个水平的并行性,并且是I / O或内存。此外,如果可用处理元素的数量仅在运行时仅在运行时已知 - 如自适应系统 - 静态方法失败。在本文中,我们通过提出能够:(a)利用多个水平的并行性以及(b)不同的内存层次结构,以及(b)不同的内存层次结构,解决一些这些问题匹配目标架构的I / O或内存功能,以实现可用处理元素的数量仅在运行时已知的场景。我们所提出的技术由两个编译时间转换组成:(a)符号分层折叠,其次是(b)符号多级调度。平铺级别安排在并行利用不同级别的并行性,而顺序级别,不同的内存层次结构。此外,通过调整各个级别上的瓦片的大小,可以在必要的I / O - 带宽和存储器之间进行折衷,这有利于遵守资源约束。结果的时间表是关于问题大小和瓦片尺寸的符号。因此,在编译时,要映射到的处理元件的数量不需要知道。在运行时,当已知可用处理器的数量时,一个简单的序言选择关于I / O和存储器约束的可行计划,该时间表是所选择的瓦片大小的延迟最佳。总之,我们的方法在编译时确定了可行性延迟最佳最佳符号循环调度候选的集合,从中在运行时动态选择。该方法利用多个级别的并行性,与循环嵌套的问题大小无关,从而避免在运行时避免任何昂贵的重新编译。这对于低成本和内存稀缺的嵌入式MPSOC平台尤其重要,可能无法承受驻留时间编译器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号