We address some key issues in designing dense linear algebra (DLA) algorithmsthat are common for both multi/many-cores and special purpose architectures(in particular GPUs). We present them in the context of an LU factorizationalgorithm, where randomization techniques are used as an alternative to pivoting.This approach yields an algorithm based entirely on a collection of small Level 3BLAS type computational tasks, which has emerged as a common goal in designingDLA algorithms for new architectures. Other common trends, also considered here,are block asynchronous task execution and “Block” layouts for the data associatedwith the separate tasks. We present numerical results and other specific experimentswith DLA algorithms on NVIDIA GPUs using CUDA. The GPU results arealso of interest themselves as we show a performance of up to 160 Glop/s on a singleQuadro FX 5600 card.
展开▼
机译:我们在设计密集线性代数(DLA)算法时解决了一些关键问题,这些算法对于多核/多核和专用架构(尤其是GPU)而言都是常见的。我们在LU分解算法的背景下介绍它们,其中使用随机化技术替代数据透视。此方法产生的算法完全基于一组小型3BLAS类型的计算任务,这已成为设计DLA算法的共同目标适用于新架构。在此还考虑的其他常见趋势是块异步任务执行和与单独任务关联的数据的“块”布局。我们使用CUDA在NVIDIA GPU上使用DLA算法展示了数值结果和其他特定实验。 GPU的结果本身也很有趣,因为我们在一张Quadro FX 5600卡上显示出高达160 Glop / s的性能。
展开▼