首页> 外文期刊>Parallel Computing >Auto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems
【24h】

Auto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems

机译:自动调整嵌套并行性:一种减少NUMA系统中科学软件执行时间的方法

获取原文
获取原文并翻译 | 示例
           

摘要

The most computationally demanding scientific problems are solved with large parallel systems. In some cases these systems are Non-Uniform Memory Access (NUMA) multiprocessors made up of a large number of cores which share a hierarchically organized memory. The main basic component of these scientific codes is often matrix multiplication, and the efficient development of other linear algebra packages is directly based on the matrix multiplication routine implemented in the BLAS library. BLAS library is used in the form of packages implemented by the vendors or free implementations. The latest versions of this library are multithreaded and can be used efficiently in multicore systems, but when they are used inside parallel codes, the two parallelism levels can interfere and produce a degradation of the performance. In this work, an auto-tuning method is proposed to select automatically the optimum number of threads to use at each parallel level when multithreaded linear algebra routines are called from OpenMP parallel codes. The method is based on a simple but effective theoretical model of the execution time of the two-level routines. The methodology is applied to a two-level matrix-matrix multiplication and to different matrix factorizations (LU, QR and Cholesky) by blocks. Traditional schemes which directly use the multithreaded routine of BIAS, dgemm, are compared with schemes combining the multithreaded dgemm with OpenMP. (C) 2014 Elsevier B.V. All rights reserved.
机译:大型并行系统可以解决最需要计算的科学问题。在某些情况下,这些系统是非均匀内存访问(NUMA)多处理器,由多个内核组成,这些内核共享分层组织的内存。这些科学代码的主要基本组成部分通常是矩阵乘法,而其他线性代数包的有效开发直接基于BLAS库中实现的矩阵乘法例程。 BLAS库以供应商实施的软件包或免费实施的形式使用。该库的最新版本是多线程的,可以在多核系统中有效地使用,但是当在并行代码中使用它们时,两个并行度级别可能会相互干扰并降低性能。在这项工作中,提出了一种自动调整方法,当从OpenMP并行代码调用多线程线性代数例程时,可以自动选择在每个并行级别使用的最佳线程数。该方法基于两级例程执行时间的简单但有效的理论模型。该方法适用于两级矩阵矩阵乘法,并且适用于按块划分的不同矩阵分解(LU,QR和Cholesky)。将直接使用BIAS多线程例程dgemm的传统方案与将多线程dgemm与OpenMP组合的方案进行了比较。 (C)2014 Elsevier B.V.保留所有权利。

著录项

  • 来源
    《Parallel Computing》 |2014年第7期|309-327|共19页
  • 作者单位

    Univ Murcia, Fac Informat, Dept Informat & Sistemas, E-30100 Murcia, Spain;

    Univ Murcia, Fac Informat, Dept Ingn Tecnol & Comp, E-30100 Murcia, Spain;

    Univ Politecn Cartagena, Serv Apoyo Invest Tecnol, Cartagena 30203, Spain;

    Univ Murcia, Fac Informat, Dept Informat & Sistemas, E-30100 Murcia, Spain;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Auto-tuning; Linear algebra; Performance modeling; NUMA;

    机译:自整定线性代数性能建模NUMA;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号