...
首页> 外文期刊>Applied Numerical Mathematics >Explicit parallel block Cholesky algorithms on the CRAY APP
【24h】

Explicit parallel block Cholesky algorithms on the CRAY APP

机译:CRAY APP上的显式并行块Cholesky算法

获取原文
获取原文并翻译 | 示例

摘要

In this paper we consider the CRAY APP, the Attached Parallel Processor of the CRAY S-MP, which consists of seven buses with each bus supporting up to 12 processing elements. Processing elements on different buses can communicate simultaneously with the shared main memory, but processing elements sharing the same bus can not, since only one processing element per bus can access memory at a given time. Applications with a high level of data reuse, or, with a high computation intensity, and applications being highly parallel are very suitable to run on the APP. An example of such an algorithm is matrix-matrix multiplication. We illustrate how the data traffic's restriction influences the performance and we discuss a performance model of the bus architecture, considering a change in processor speed, data traffic speed and cache contents. Furthermore, two different algorithms for Cholesky factorization are discussed: a block left-looking algorithm and a block right-looking algorithm. The maximum achievable speed on the CRAY APP is mainly determined by the performance of the matrix-matrix multiplication. Parallelism is applied explicitly over the blocks, which makes it possible to concatenate different block operations in cache. The results obtained on CWI's APP (a machine having twenty-eight processing elements) indicate how block algorithms can be parallelized on machines with hundreds or thousands of processors.
机译:在本文中,我们考虑CRAY APP,它是CRAY S-MP的附加并行处理器,它由七个总线组成,每个总线最多支持12个处理元素。不同总线上的处理元件可以同时与共享的主内存进行通信,但是共享同一总线的处理元件不能进行通信,因为每个总线上只有一个处理元件可以在给定时间访问内存。具有高数据重用级别或具有高计算强度的应用程序以及高度并行的应用程序非常适合在APP上运行。这种算法的一个例子是矩阵矩阵乘法。我们说明了数据流量的限制如何影响性能,并讨论了总线体系结构的性能模型,其中考虑了处理器速度,数据流量速度和缓存内容的变化。此外,还讨论了两种用于Cholesky分解的算法:块左眼算法和块右眼算法。 CRAY APP上可达到的最大速度主要取决于矩阵矩阵乘法的性能。并行性被明确地应用到块上,这使得可以在高速缓存中连接不同的块操作。在CWI的APP(一台具有28个处理元素的机器)上获得的结果表明,如何在具有数百或数千个处理器的机器上并行执行块算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号