A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction

机译：两阶段对角线简化中选择并行度粒度的任务合并的综合研究

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We present new high performance numerical kernels combined with advanced optimization techniques that significantly increase the performance of parallel bidiagonal reduction. Our approach is based on developing efficient fine-grained computational tasks as well as reducing overheads associated with their high-level scheduling during the so-called bulge chasing procedure that is an essential phase of a scalable bidiagonalization procedure. In essence, we coalesce multiple tasks in a way that reduces the time needed to switch execution context between the scheduler and useful computational tasks. At the same time, we maintain the crucial information about the tasks and their data dependencies between the coalescing groups. This is the necessary condition to preserve numerical correctness of the computation. We show our annihilation strategy based on multiple applications of single orthogonal reflectors. Despite non-trivial characteristics in computational complexity and memory access patterns, our optimization approach smoothly applies to the annihilation scenario. The coalescing positively influences another equally important aspect of the bulge chasing stage: the memory reuse. For the tasks within the coalescing groups, the data is retained in high levels of the cache hierarchy and, as a consequence, operations that are normally memory-bound increase their ratio of computation to off-chip communication and become compute-bound which renders them amenable to efficient execution on multicore architectures. The performance for the new two-stage bidiagonal reduction is staggering. Our implementation results in up to 50-fold and 12-fold improvement (~130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000 × 24000. Last but not least, we provide a comprehensive study on the impact of the coalescing group size in term- of cache utilization and power consumption in the context of this new two-stage bidiagonal reduction.

机译：我们提出了新的高性能数值内核，并结合了先进的优化技术，可显着提高平行双对角线折减的性能。我们的方法基于开发有效的细粒度计算任务，并减少所谓的“隆起追逐”过程中与它们的高级调度相关的开销，这是可缩放的对角化过程的重要阶段。本质上，我们以减少在调度程序和有用的计算任务之间切换执行上下文所需的时间的方式合并多个任务。同时，我们维护有关任务及其在合并组之间的数据依存关系的重要信息。这是保持计算数值正确性的必要条件。我们展示了基于单个正交反射器的多种应用的an灭策略。尽管计算复杂性和内存访问模式具有非平凡的特性，但我们的优化方法仍可平稳地应用于歼灭场景。合并积极地影响了追逐阶段的另一个同等重要的方面：内存重用。对于合并组中的任务，数据保留在高速缓存层次结构的较高级别中，因此，通常是内存绑定的操作会增加其计算与片外通信的比率，并成为计算绑定的结果，从而使它们适于在多核体系结构上高效执行。新的两阶段对角线折弯的性能惊人。与八插槽六核AMD Opteron多核处理器上的LAPACK V3.2和Intel MKL V10.3的等效例程相比，我们的实现分别带来了50倍和12倍的改进（〜130 Gflop / s）。共享内存系统，矩阵大小为24000×24000。最后但并非最不重要的一点是，我们提供了一个新的两阶段对角线背景下，合并组大小对高速缓存利用率和功耗的影响的综合研究。减少。

著录项

来源
《2012 IEEE 26th International Parallel and Distributed Processing Symposium》|2012年|p.25- 35|共11页
会议地点 Shanghai(CN)
作者
Haidar Azzam; Ltaief Hatem; Luszczek Piotr; Dongarra Jack;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类 TP311.133;
关键词

相似文献

外文文献
中文文献
专利

1. A comprehensive study of MPI parallelism in three-dimensional discrete element method (DEM) simulation of complex-shaped granular particles [J] . Yan Beichuan, Regueiro Richard A. Computational particle mechanics . 2018,第4期

机译：三维离散元素法（DEM）模拟复合物粒状粒子模拟的全面研究
2. Granular Cloning: Intra-Object Parallelism in Ensemble Studies [J] . Philip Pecher, John Crittenden, Zhongming Lu, Proceedings of the Workshop on Principles of Advanced and Distributed Simulation . 2018,第CDaROM期

机译：粒状克隆：集合研究中的物体内侧
3. Exploiting Task Parallelism with OpenCL: A Case Study [J] . Jaaskelainen Pekka, Korhonen Ville, Koskela Matias, Journal of signal processing systems for signal, image, and video technology . 2019,第1期

机译：使用OpenCL开发任务并行性：一个案例研究
4. A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction [C] . Haidar Azzam, Ltaief Hatem, Luszczek Piotr, IEEE International Parallel Distributed Processing Symposium . 2012

机译：两阶段双阶段减少方面采用直行粒度的综合研究
5. A study in acceleration of selected artificial intelligence computations using thread-level parallelism. [D] . Niles, Kisron. 2014

机译：使用线程级并行性加速选定的人工智能计算的研究。
6. False-Negative-Rate Based Approach for Selecting Top Single-Nucleotide Polymorphisms in the First Stage of a Two-Stage Genome-Wide Association Study [O] . Zhuying Huang, Jian Wang, Chih-Chieh Wu, -1

机译：假阴性速率在两个阶段的全基因组关联分析的第一阶段选择顶级单核苷酸多态性为基础的方法
7. A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction [O] . Haidar, Azzam, Ltaief, Hatem, Luszczek, Piotr R., 2012

机译：两阶段对角线约化中选择并行度的任务合并的综合研究
8. Comprehensive Study on the Effects of Temperature, Surface Age, Added Surfactant, Salinity, and Bulk Viscosity on Coalescence Time, Film Rigidity, and Interfacial Tension: Topical Report [R] . Peru, D. A. , Lorenz, P. B. 1988

机译：温度，表面年龄，表面活性剂，盐度和体积粘度对聚结时间，膜刚度和界面张力影响的综合研究：专题报告

A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅