首页> 外文学位 >A journey through performance evaluation, tuning, and analysis of parallelized applications and parallel architectures: Quantitative approach.

【24h】

A journey through performance evaluation, tuning, and analysis of parallelized applications and parallel architectures: Quantitative approach.

机译：并行应用程序和并行体系结构的性能评估，调整和分析的过程：定量方法。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

In today's multicore era, with the persistently improved fabrication technology, the new challenge is to find applications (i.e. killer Apps) that exploit the increased computational power. Automatic parallelization of sequential programs combined with tuning techniques is an alternative to manual parallelization that saves programmer time and effort. Hand parallelization is tedious, error-prone process. A key difficulty is that parallelizing compilers are generally unable to estimate the performance impact of an optimization on a whole program or a program section at compile time; hence, the ultimate performance decision today rests with the developer. Building an autotuning system to remedy this situation is not a trivial task. Automatic parallelization concentrates on finding any possible parallelism in the program, whereas tuning systems help identifying efficient parallel code segments and profitable optimization techniques. A key limitation of advanced optimizing compilers is their lack of runtime information, such as the program input data.;With the renewed relevance of autoparallelizers, a comprehensive evaluation will identify strengths and weaknesses in the underlying techniques and direct researchers as well as engineers to potential improvements. No comprehensive study has been conducted on modern parallelizing compilers for today's multicore systems. Such study needs to evaluate different levels of techniques and their interactions, which requires efficiently navigating over a large search spaces of optimization variants. With the recently revealed non-trivial parallel architectures, a programmer needs to learn the behavior of these systems with respect to their programs in order to orchestrate it for a maximized utilization of a gazillion of CPU cycles available.;In this dissertation, we go in a journey through parallel applications and parallel architectures in quantitative approach. This work presents a portable empirical autotuning system that operates at program-section granularity and partitions the compiler options into groups that can be tuned independently. To our knowledge, this is the first approach delivering an autoparallelization system that ensures performance improvements for nearly all programs, eliminating the users' need to "experiment" with such tools to strive for highest application performance. This method has the potential to substantially increase productivity and is thus of critical importance for exploiting the increased computational power of today's multicores.;We present an experimental methodology for comprehensively evaluating the effectiveness of parallelizing compilers and their underlying optimization techniques. The methodology takes advantage of the proposed customizable tuning system that can efficiently evaluate a large space of optimization variants. We applied the proposed methodology on five modern parallelizing compilers and their tuning capabilities; we reported speedups, parallel coverage, and the number of parallel loops, using the NAS Benchmarks as a program suite. As there is an extensive body of proposed compiler analyses and transformations for parallelization, the question of the importance of the techniques arises. This work evaluates the impact of the individual optimization techniques on the overall program performance and discusses their mutual interactions. We study the differences between polyhedral model based compilers and Abstract Syntax Tree compilers. We also study the scalability of IBM BlueGeneQ and Intel MIC Architectures as representatives of modern multicore systems.;We found parallelizers to be reasonably successful in about half of the given science-engineering programs. Advanced versions of some of the techniques identified as most successful in previous generations of compilers are also most important today, while other techniques have risen significantly in impact. An important finding is also that some techniques substitute each other. Furthermore, we found that automatic tuning can lead to significant additional performance and sometimes matches or outperforms hand parallelized programs. We analyze specific reasons for the measured performance and the potential for improvement of automatic parallelization. On average overall programs, BlueGeneQ and MIC systems could achieve a scalability factor of 1.5.

机译：在当今的多核时代，随着制造技术的不断改进，新的挑战是寻找能够利用增加的计算能力的应用程序（即杀手级应用程序）。顺序程序的自动并行化与调整技术相结合，是手动并行化的一种替代选择，可以节省程序员的时间和精力。手工并行化是乏味且容易出错的过程。关键的困难在于，并行化编译器通常无法在编译时估计优化对整个程序或程序段的性能影响。因此，今天的最终性能决定权在于开发人员。建立一个自动调整系统来纠正这种情况并非易事。自动并行化专注于在程序中查找任何可能的并行性，而调整系统则有助于识别有效的并行代码段和有利可图的优化技术。高级优化编译器的一个关键限制是它们缺乏运行时信息（例如程序输入数据）。随着自动并行器的相关性不断增强，全面的评估将确定基础技术的优缺点，并指导研究人员和工程师挖掘潜力改进。对于当今的多核系统，尚未对现代并行化编译器进行全面研究。此类研究需要评估不同级别的技术及其相互作用，这需要有效地在优化变量的大型搜索空间中导航。借助最近揭示的非平凡的并行体系结构，程序员需要学习这些系统相对于其程序的行为，以便对其进行编排，以最大程度地利用大量可用的CPU周期。以定量方法完成并行应用程序和并行体系结构的旅程。这项工作提出了一个可移植的经验式自动调整系统，该系统以程序部分的粒度运行，并将编译器选项划分为可以独立调整的组。就我们所知，这是第一种提供自动并行化系统的方法，该系统可以确保几乎所有程序的性能提高，而无需用户使用此类工具进行“实验”以争取最高的应用程序性能。这种方法具有极大提高生产率的潜力，因此对于利用当今多核计算能力的提高至关重要。我们提供了一种实验方法，可以全面评估并行化编译器及其底层优化技术的有效性。该方法利用了建议的可定制调整系统，该系统可以有效地评估优化变量的较大空间。我们将建议的方法应用于五个现代并行化编译器及其调整功能。我们使用NAS Benchmarks作为程序套件报告了加速，并行覆盖和并行循环数。随着大量提议的编译器分析和并行化转换的提出，出现了技术重要性的问题。这项工作评估了各个优化技术对总体程序性能的影响，并讨论了它们之间的相互影响。我们研究了基于多面体模型的编译器与抽象语法树编译器之间的差异。我们还研究了代表现代多核系统的IBM BlueGeneQ和Intel MIC体系结构的可扩展性。我们发现并行器在大约一半的给定科学工程程序中都相当成功。在上一代编译器中被认为最成功的某些技术的高级版本在今天也很重要，而其他技术的影响力已显着提高。一个重要的发现是某些技术可以相互替代。此外，我们发现自动调整可以带来显着的额外性能，有时匹配或优于手动并行程序。我们分析了测量性能的特定原因以及改进自动并行化的潜力。平均而言，BlueGeneQ和MIC系统的总体程序可实现1.5的可伸缩性因子。

著录项

作者
Mustafa, Dheya G.;
展开▼
作者单位

Purdue University.;

展开▼
授予单位 Purdue University.;
学科 Engineering Computer.;Computer Science.
学位 Ph.D.
年度 2013
页码 137 p.
总页数 137
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Performance Analysis of Homogeneous On-Chip Large-Scale Parallel Computing Architectures for Data-Parallel Applications [J] . Xiaowen Chen, Zhonghai Lu, Axel Jantsch, Journal of electrical and computer engineering . 2015,第期

机译：数据并行应用程序的同类片上大规模并行计算体系结构的性能分析
2. Performance Analysis of Homogeneous On-Chip Large-Scale Parallel Computing Architectures for Data-Parallel Applications [J] . XiaowenChen, ZhonghaiLu, AxelJantsch, Journal of Electrical and Computer Engineering . 2015,第1期

机译：数据并行应用程序的同类片上大规模并行计算体系结构的性能分析
3. Performance assessment of parallel spectral analysis: towards a practical performance model for parallel medical applications [J] . F. Munz, T. Ludwig, S. Ziegler, Future generation computer systems . 2000,第5期

机译：平行光谱分析的性能评估：针对并行医疗应用的实用性能模型
4. Performance Analysis and Tuning of Automatically Parallelized OpenMP Applications [C] . Dheya Mustafa, Aurangzeb, Rudolf Eigenmann OpenMP in the petascale era . 2011

机译：自动并行OpenMP应用程序的性能分析和调整
5. Parallelization and performance optimization of bioinformatics and biomedical applications targeted to advanced computer architectures. [D] . Niu, Yanwei. 2005

机译：针对高级计算机体系结构的生物信息学和生物医学应用程序的并行化和性能优化。
6. Performance of parallel FDTD method for shared- and distributed-memory architectures: Application tobioelectromagnetics [O] . Miguel Ruiz-Cabello N., Maksims Abaļenkovs, Luis M. Diaz Angulo, 2020

机译：共享和分布式内存架构并行FDTD方法的性能：应用脚踏电磁
7. Performance Analysis of Homogeneous On-Chip Large-Scale Parallel Computing Architectures for Data-Parallel Applications [O] . Xiaowen Chen, Zhonghai Lu, Axel Jantsch, 2015

机译：用于数据并行应用的均匀芯片大规模平行计算架构的性能分析
8. Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures [R] . Jost, G. , Jin, H. , Labarta, J. , 2002

机译：共享存储器架构的多级并行应用性能分析

A journey through performance evaluation, tuning, and analysis of parallelized applications and parallel architectures: Quantitative approach.

摘要

著录项

相似文献

相关主题

期刊订阅