NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch

机译：NVIDIA GPU可扩展性解决umberhomasbatch的多种（批量）的曲线系统实现

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The solving of tridiagonal systems is one of the most computationally expensive parts in many applications, so that multiple studies have explored the use of NVIDIA GPUs to accelerate such computation. However, these studies have mainly focused on using parallel algorithms to compute such systems, which can efficiently exploit the shared memory and are able to saturate the GPUs capacity with a low number of systems, presenting a poor scalability when dealing with a relatively high number of systems. We propose a new implementation (cuThomasBatch) based on the Thomas algorithm. To achieve a good scalability using this approach is necessary to carry out a transformation in the way that the inputs are stored in memory to exploit coalescence (contiguous threads access to contiguous memory locations). The results given in this study proves that the implementation carried out in this work is able to beat the reference code when dealing with a relatively large number of Tridiagonal systems (2,000-256,000), being closed to 3× (in double precision) and 4× (in single precision) faster using one Kepler NVIDIA GPU.

机译：三对角系统的解决是在许多应用中计算最昂贵的部件之一，以便多个研究探讨采用NVIDIA的GPU来加速这种计算。然而，这些研究主要集中于使用并行算法来计算这样的系统，它可以有效地利用共享存储器，并能以低数量的系统以饱和的GPU容量，具有相对高数量的处理时呈现较差的可扩展性系统。我们提出了基于Thomas算法一个新的实现（cuThomasBatch）。为了实现使用这种方法良好的扩展性，需要进行，所述输入被存储在存储器中以利用聚结（邻近的线程访问连续的存储器位置）的方式变换。在这项研究中给出的结果证明，实施在这项工作中进行了能够与相对较大数量的三对角系统（2,000-256,000）处理时击败参考代码，被封闭以3×（在双精度）和4- ×（在单精度）更快的使用一种开普勒NVIDIA GPU。

著录项

来源
《International Conference on Parallel Processing and Applied Mathematics》|2018年|660p|共11页
会议地点
作者
Pedro Valero-Lara; Ivan Martinez-Perez; Raul Sirvent; Xavier Martorell; Antonio J. Pena;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP316.4-53;
关键词
Tridiagonal linear systems; Scalability Thomas algorithm; PCR; CR; Parallel processing; CuSPARSE CUDA;

机译：Tridiacal线性系统;可扩展性托马斯算法;PCR;CR;并行处理;截止;

相似文献

外文文献
中文文献
专利

1. cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs [J] . Pedro Valero-Lara, Ivan Martinez-Perez, Rauel Sirvent, Concurrency and Computation . 2018,第24期

机译：cuThomasBatch和cuThomasVBatch，CUDA例程，用于在NVIDIA GPU上计算一批三对角系统
2. Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs [J] . Macintosh Hamish J., Banks Jasmine E., Kelson Neil A. International journal of reconfigurable computing . 2019,第PTa1期

机译：实现和评估具有OpenCL的异构，可伸缩的Tridgonal线性系统求解器，以靶向FPGA，GPU和CPU
3. Implementing and Evaluating an Heterogeneous, Scalable, Tridiagonal Linear System Solver with OpenCL to Target FPGAs, GPUs, and CPUs [J] . Hamish J. Macintosh, Jasmine E. Banks, Neil A. Kelson International journal of reconfigurable computing . 2019,第5aaPagea2期

机译：实现和评估具有OpenCL的异构，可伸缩的Tridgonal线性系统求解器，以靶向FPGA，GPU和CPU
4. NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch [C] . Pedro Valero-Lara, Ivan Martinez-Perez, Raul Sirvent, International conference on parallel processing and applied mathematics . 2018

机译：NVIDIA GPU可扩展性，可解决cuThomasBatch的多个（批）三对角系统实现
5. More effective use of high performance systems using sub-batch allocation resource management within multiple component multiple data applications. [D] . Foley, Samantha S. 2010

机译：通过在多个组件多个数据应用程序中使用子批处理分配资源管理，更有效地使用高性能系统。
6. LASSIE: simulating large-scale models of biochemical systems on GPUs [O] . Andrea Tangherloni, Marco S. Nobile, Daniela Besozzi, 2017

机译：LASSIE：在GPU上模拟生化系统的大规模模型
7. cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs [O] . Pedro Valero-Lara, Ivan Martínez-Pérez, Raül Sirvent, 2018

机译：Cuthomasbatch和Cuthomasvbatch，CUDA惯例计算NVIDIA GPU上的三角形系统批次

NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅