首页> 外文会议>2012 IEEE 26th International Parallel and Distributed Processing Symposium >A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction
【24h】

A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction

机译:两阶段对角线简化中选择并行度粒度的任务合并的综合研究

获取原文
获取原文并翻译 | 示例

摘要

We present new high performance numerical kernels combined with advanced optimization techniques that significantly increase the performance of parallel bidiagonal reduction. Our approach is based on developing efficient fine-grained computational tasks as well as reducing overheads associated with their high-level scheduling during the so-called bulge chasing procedure that is an essential phase of a scalable bidiagonalization procedure. In essence, we coalesce multiple tasks in a way that reduces the time needed to switch execution context between the scheduler and useful computational tasks. At the same time, we maintain the crucial information about the tasks and their data dependencies between the coalescing groups. This is the necessary condition to preserve numerical correctness of the computation. We show our annihilation strategy based on multiple applications of single orthogonal reflectors. Despite non-trivial characteristics in computational complexity and memory access patterns, our optimization approach smoothly applies to the annihilation scenario. The coalescing positively influences another equally important aspect of the bulge chasing stage: the memory reuse. For the tasks within the coalescing groups, the data is retained in high levels of the cache hierarchy and, as a consequence, operations that are normally memory-bound increase their ratio of computation to off-chip communication and become compute-bound which renders them amenable to efficient execution on multicore architectures. The performance for the new two-stage bidiagonal reduction is staggering. Our implementation results in up to 50-fold and 12-fold improvement (~130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000 × 24000. Last but not least, we provide a comprehensive study on the impact of the coalescing group size in term- of cache utilization and power consumption in the context of this new two-stage bidiagonal reduction.
机译:我们提出了新的高性能数值内核,并结合了先进的优化技术,可显着提高平行双对角线折减的性能。我们的方法基于开发有效的细粒度计算任务,并减少所谓的“隆起追逐”过程中与它们的高级调度相关的开销,这是可缩放的对角化过程的重要阶段。本质上,我们以减少在调度程序和有用的计算任务之间切换执行上下文所需的时间的方式合并多个任务。同时,我们维护有关任务及其在合并组之间的数据依存关系的重要信息。这是保持计算数值正确性的必要条件。我们展示了基于单个正交反射器的多种应用的an灭策略。尽管计算复杂性和内存访问模式具有非平凡的特性,但我们的优化方法仍可平稳地应用于歼灭场景。合并积极地影响了追逐阶段的另一个同等重要的方面:内存重用。对于合并组中的任务,数据保留在高速缓存层次结构的较高级别中,因此,通常是内存绑定的操作会增加其计算与片外通信的比率,并成为计算绑定的结果,从而使它们适于在多核体系结构上高效执行。新的两阶段对角线折弯的性能惊人。与八插槽六核AMD Opteron多核处理器上的LAPACK V3.2和Intel MKL V10.3的等效例程相比,我们的实现分别带来了50倍和12倍的改进(〜130 Gflop / s)。共享内存系统,矩阵大小为24000×24000。最后但并非最不重要的一点是,我们提供了一个新的两阶段对角线背景下,合并组大小对高速缓存利用率和功耗的影响的综合研究。减少。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号