首页> 外文学位 >Metascalable hybrid message-passing and multithreading algorithms for n-tuple computation.
【24h】

Metascalable hybrid message-passing and multithreading algorithms for n-tuple computation.

机译:用于n元组计算的metascalable混合消息传递和多线程算法。

获取原文
获取原文并翻译 | 示例

摘要

This dissertation research develops a scalable hybrid message-passing and multithreading algorithm for n-tuple molecular dynamics (MD) simulation, which will continue to scale on future architectures ( i.e. achieving metascalability). The two major goals of this dissertation research are: (1) design a scalable hybrid message-passing and multithreading parallel algorithmic framework on multicore architectures and evaluate it on most advanced parallel architectures; and (2) develop a computation-pattern algebraic framework to design scalable algorithms for general n-tuple computation and prove its optimality in a systematic and mathematically rigorous manner.;To achieve the first goal, we have developed and thoroughly analyzed algorithms for hybrid message passing interface (MPI) + open multiprocessing (OpenMP) parallelization of n-tuple MD simulation, which are scalable on large multicore clusters. Two data-privatization thread scheduling algorithms via nucleation-growth allocation have been designed: (1) compact- volume allocation scheduling (CVAS); and (2) breadth-first allocation scheduling (BFAS). These two algorithms combine fine-grain dynamic load balancing and minimal memory- footprint threading. Theoretical study has revealed decent asymptotic memory efficiency for both algorithms, thereby reducing 75% memory consumption compared to a naive-threading algorithm. Furthermore, performance benchmarks have confirmed higher performance of the hybrid MD algorithm over a traditional algorithm on large multicore clusters, where 2.58-fold speedup of the hybrid algorithm over the traditional algorithm was observed on 32,768 nodes of IBM BlueGene/P.;To achieve the second goal, we have developed a computation-pattern algebraic framework to mathematically formulate general n-tuple computation. Based on translation/reflection-invariant properties of computation patterns within this framework, we have designed a shift-collapse (SC) algorithm for cell-based parallel MD. Theoretical analysis has quantified the compact n-tuple search space and small communication cost of SC-MD for arbitrary n, which are reduced to those in best pair-computation approaches (e.g. eighth-shell method) for n = 2. Benchmark tests have shown that SC-MD outperforms our production MD code at the finest grain, with 9.7- and 5.1-fold speedups on Intel-Xeon and BlueGene/Q clusters. SC-MD has also exhibited excellent strong scalability.;In addition, we have analyzed the computational and data-access patterns of MD, which led to the development of a performance prediction model for short-range pair-wise force computations in MD simulations. The analysis and performance model provide fundamental understanding of computation patterns and optimality of certain parameters in MD simulations, thus allowing scientists to determine the optimal cell dimension in a linked-list cell method. The model has accurately estimated the number of operations during the simulations with the maximum error of 10.6% compared to actual measurements. Analysis and benchmark of the model have revealed that the optimal cell dimension minimizing the computation time is determined by a trade-off between decreasing search space and increasing linked-list cell access for smaller cells.;One difficulty about MD is that it is a dynamic irregular application, which often suffers considerable performance deterioration during execution. To address this problem, an optimal data-reordering schedule has been developed for runtime memory-access optimization of MD simulations on parallel computers. Analysis of the memory-access penalty during MD simulations has shown that the performance improvement from computation and data reordering degrades gradually as data translation lookaside buffer misses increase. We have also found correlations between the performance degradation with physical properties such as the simulated temperature, as well as with computational parameters such as the spatial-decomposition granularity. (Abstract shortened by UMI.).
机译:本论文的研究为n元组分子动力学(MD)仿真开发了一种可扩展的混合消息传递和多线程算法,该算法将继续在未来的体系结构上进行扩展(即实现元可伸缩性)。本论文研究的两个主要目标是:(1)在多核体系结构上设计一种可扩展的混合消息传递和多线程并行算法框架,并在最先进的并行体系结构上进行评估。 (2)开发一种计算模式代数框架,以设计用于通用n元组计算的可扩展算法,并以系统和严格的数学方式证明其最优性。;为了实现第一个目标,我们开发并彻底分析了混合消息算法n元组MD仿真的传递接口(MPI)+开放式多处理(OpenMP)并行化,可在大型多核群集上进行扩展。设计了两种通过成核-增长分配的数据私有化线程调度算法:(1)紧凑型卷分配调度(CVAS); (2)广度优先分配调度(BFAS)。这两种算法结合了细粒度的动态负载平衡和最小的内存占用线程。理论研究表明,两种算法的渐近存储效率都很高,因此与朴素线程算法相比,可减少75%的内存消耗。此外,性能基准已经证实,在大型多核集群上,混合MD算法的性能比传统算法高,在IBM BlueGene / P的32,768个节点上,混合算法的速度比传统算法提高了2.58倍;第二个目标,我们开发了一种计算模式代数框架,以数学方式表示一般的n元组计算。基于此框架内计算模式的平移/反射不变特性,我们针对基于单元格的并行MD设计了移位折叠(SC)算法。理论分析已量化了任意n的紧凑n元组搜索空间和SC-MD的较小通信成本,在n = 2的情况下,这些成本已降低到最佳配对计算方法(例如,八层壳法)中的成本。 SC-MD的性能要优于我们的生产MD代码,在Intel-Xeon和BlueGene / Q群集上的速度分别提高了9.7和5.1倍。 SC-MD还具有出色的强大可伸缩性。此外,我们分析了MD的计算和数据访问模式,从而导致了MD仿真中用于短程成对力计算的性能预测模型的发展。分析和性能模型提供了对MD模拟中的计算模式和某些参数的最优性的基本了解,从而使科学家可以通过链表单元法确定最佳的单元尺寸。该模型已准确估计了仿真期间的操作次数,与实际测量值相比,最大误差为10.6%。该模型的分析和基准测试表明,最小化计算时间的最佳像元尺寸是由减小搜索空间和增加对较小像元的链表像元访问之间的权衡决定的。MD的一个难题是它是动态的不规则应用程序,在执行过程中通常会遭受相当大的性能下降。为了解决这个问题,已经为并行计算机上的MD模拟的运行时内存访问优化开发了最佳的数据重新排序时间表。在MD仿真过程中对内存访问损失的分析表明,随着数据转换后备缓冲区缺失的增加,计算和数据重新排序的性能改进会逐渐降低。我们还发现性能下降与物理属性(如模拟温度)以及计算参数(如空间分解粒度)之间存在相关性。 (摘要由UMI缩短。)。

著录项

  • 作者

    Kunaseth, Manaschai.;

  • 作者单位

    University of Southern California.;

  • 授予单位 University of Southern California.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 203 p.
  • 总页数 203
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号