首页> 外文期刊>Journal of Parallel and Distributed Computing >Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends
【24h】

Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

机译:跨尺度有效张量收缩,用于通过多个编程模型后端进行耦合群集计算

获取原文
获取原文并翻译 | 示例

摘要

Coupled-cluster methods provide highly accurate models of molecular structure through explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix-matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts to extend the Libtensor framework to work in the distributed memory environment in a scalable and energy-efficient manner. We achieve up to 240x speedup compared with the optimized shared memory implementation of Libtensor. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures (Cray XC30 and XC40, and IBM Blue Gene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance, tasking and bulk synchronous models. Nevertheless, we preserve a unified interface to both programming models to maintain the productivity of computational quantum chemists.
机译:耦合簇方法通过对表示电子之间相关性的张量进行显式数值计算,从而提供了高度精确的分子结构模型。这些计算受一系列张量收缩的支配,从而推动了此类操作的数值库的发展。虽然基于矩阵矩阵乘法,但这些库专门用于利用分子结构和电子相互作用中的对称性,从而减少张量表示的大小和收缩的复杂性。生成的算法是不规则的,并且它们的并行化以前已经通过使用动态调度或专用数据分解来实现。我们介绍了扩展Libtensor框架以在可扩展和节能方式下在分布式内存环境中工作的努力。与Libtensor的优化共享内存实现相比,我们实现了240倍的加速。我们在三种分布式内存架构(Cray XC30和XC40,以及IBM Blue Gene / Q)以及异构GPU-CPU系统(Cray XK7)上实现了成千上万个计算核心的可伸缩性。随着瓶颈随着分子系统规模的扩大而从计算绑定的DGEMM转移到通信绑定的集合,我们采用两种截然不同的并行化方法来处理负载不平衡,任务分配和批量同步模型。不过,我们保留了两个编程模型的统一接口,以保持计算量子化学家的生产力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号