首页> 外文学位 >Performance optimization of memory-bound programs on data parallel accelerators.

【24h】

Performance optimization of memory-bound programs on data parallel accelerators.

机译：在数据并行加速器上优化内存绑定程序的性能。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

High performance applications depend on high utilization of memory bandwidth and computing resources, and data parallel accelerators have proven to be very effective in providing both, when needed. However, memory bound programs push the limits of system bandwidth, causing under-utilization in computing resources and thus energy inefficient executions. The objective of this research is to investigate opportunities on data parallel accelerators (i.e., SIMD units and GPUs) and design solutions for improving the performance of three classes of memory-bound applications: stencil computation, sparse matrix-vector multiplication (SpVM) and graph analytics.;This research first focuses on performance bottlenecks of stencil computations on short-vector SIMD ISAs and presents StVEC, a hardware-based solution for extending the vector ISA and improving data movement and bandwidth utilization. StVEC includes an extension to the standard addressing mode of vector floating-point instructions in contemporary vector ISAs (e.g. SSE, AVX, VMX). A code generation approach is designed and implemented to help a vectorizing compiler generate code for processors with StVEC extensions. Using an optimistic as well as a pessimistic emulation of the proposed StVEC instructions, it is shown that the proposed solution can be effective on top of SSE and AVX capable processors. To analyze hardware overhead, parts of the proposed design are synthesized using a 45nm CMOS library and shown to have minimal impact on processor cycle time.;As the second class of memory-bound programs, this research has focused on sparse matrix-vector multiplications (SpMV) on GPUs and shown that no sparse matrix representation is consistently superior, with the best representation being dependent on the matrix sparsity patterns. This part focuses on four standard sparse representations (i.e. CSR, ELL, COO and a hybrid ELL-COO) and studies the correlations between SpMV performance and the sparsity features. The research then uses machine learning techniques to automatically select the best sparse representation for a given matrix. Extensive characterization of pertinent sparsity features is performed on around 700 sparse matrices and their SpMV performance with different sparse representations. Applying learning on such a rich dataset leads to developing a decision model to automatically select the best representation for a given sparse matrix on a given target GPU. Experimental results on three GPUs demonstrate that the approach is very effective in selecting the best representation.;The last part is dedicated to characterizing performance of graph processing systems on GPUs. It focuses on a vertex-centric graph programming framework (Virtual Warp Centric, VWC), and characterizes performance bottlenecks when running different graph primitives. The analysis shows how sensitive the VWC parameter is to the input graph and signifies the importance of selecting the correct warp size in order to avoid performance penalties. The study also applies machine learning techniques on the input dataset in order to predict the best VWC configuration for a given graph. It shows the applicability of simple machine learning models to improve performance and reduce the auto-tuning time for graph algorithms on GPUs.

机译：高性能应用程序依赖于内存带宽和计算资源的高利用率，并且事实证明，在需要时，数据并行加速器在提供这两者方面非常有效。但是，受内存限制的程序会限制系统带宽，从而导致计算资源的利用不足，从而导致能源效率低下的执行。这项研究的目的是研究数据并行加速器（即SIMD单元和GPU）上的机会，并设计解决方案以提高三类内存绑定应用程序的性能：模具计算，稀疏矩阵矢量乘法（SpVM）和图形这项研究首先关注于短向量SIMD ISA上模板计算的性能瓶颈，并提出了StVEC，这是一种基于硬件的解决方案，用于扩展向量ISA并改善数据移动和带宽利用率。 StVEC包括对现代向量ISA（例如SSE，AVX，VMX）中向量浮点指令的标准寻址模式的扩展。设计并实现了一种代码生成方法，以帮助矢量化编译器为带有StVEC扩展的处理器生成代码。通过对拟议的StVEC指令进行乐观和悲观的仿真，结果表明，所提出的解决方案可以在具有SSE和AVX功能的处理器之上有效。为了分析硬件开销，使用45nm CMOS库对拟议设计的部分进行了合成，并显示出对处理器周期时间的影响最小。作为第二类内存绑定程序，本研究着重于稀疏矩阵向量乘法（ SpMV），并显示没有稀疏的矩阵表示形式始终具有优越性，最佳表示形式取决于矩阵稀疏性模式。本部分重点介绍四种标准的稀疏表示形式（即CSR，ELL，COO和混合ELL-COO），并研究SpMV性能与稀疏特征之间的相关性。然后，研究使用机器学习技术为给定的矩阵自动选择最佳的稀疏表示。在大约700个稀疏矩阵及其具有不同稀疏表示的SpMV性能上进行了相关稀疏特征的广泛表征。在如此丰富的数据集上应用学习会导致开发决策模型，以自动为给定目标GPU上的给定稀疏矩阵选择最佳表示。在三个GPU上的实验结果表明，该方法在选择最佳表示形式方面非常有效。最后一部分致力于表征GPU上图形处理系统的性能。它着重于以顶点为中心的图形编程框架（Virtual Warp Centric，VWC），并描述了运行不同图形图元时的性能瓶颈。分析显示VWC参数对输入图的敏感程度，并表明选择正确的经纱尺寸以避免性能损失的重要性。这项研究还对输入数据集应用了机器学习技术，以预测给定图的最佳VWC配置。它显示了简单的机器学习模型的适用性，以提高性能并减少GPU上图形算法的自动调整时间。

著录项

作者
Sedaghati Mokhtari, Naseraddin.;
展开▼
作者单位

The Ohio State University.;

展开▼
授予单位 The Ohio State University.;
学科 Engineering.;Computer science.;Computer engineering.
学位 Ph.D.
年度 2016
页码 166 p.
总页数 166
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Structure-driven optimizations for amorphous data-parallel programs [J] . Mario Méndez-Lojo, Nguyen Donald, Prountzos Dimitrios, ACM SIGPLAN Notices: A Monthly Publication of the Special Interest Group on Programming Languages . 2010,第5期

机译：非晶数据并行程序的结构驱动优化
2. Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance [J] . Ruoming Jin, Ge Yang, Agrawal G. IEEE Transactions on Knowledge and Data Engineering . 2005,第1期

机译：数据挖掘算法的共享内存并行化：技术，编程接口和性能
3. High Performance Computation of Big Data: Performance Optimization Approach towards a Parallel Frequent Item Set Mining Algorithm for Transaction Data based on Hadoop MapReduce Framework [J] . Guru Prasad M S, Nagesh H R, Swathi Prabhu International Journal of Intelligent Systems and Applications . 2017,第1期

机译：大数据的高性能计算：基于Hadoop MapReduce框架的事务数据并行频繁项集挖掘算法的性能优化方法
4. Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs [C] . Ayesha Afzal, Georg Hager, Gerhard Wellein International Conference ISC High Performance: International Conference on High Performance Computing . 2020

机译：MPI并行和混合内存绑定程序中的不同步和波形形成
5. High performance computing for massive LiDAR data processing with optimized GPU parallel programming. [D] . Yuan, Chen. 2012

机译：通过优化的GPU并行编程为大规模LiDAR数据处理提供高性能计算。
6. Analysis of Parallel Algorithms on SMP Node and Cluster of Workstations Using Parallel Programming Models with New Tile-based Method for Large Biological Datasets [O] . D. D. Shrimankar, S. R. Sathe 2016

机译：大型生物数据集基于新图块的并行编程模型对SMP节点和工作站集群的并行算法进行分析
7. Structure-driven optimizations for amorphous data-parallel programs [O] . Mario Méndez-lojo, Donald Nguyen, Dimitrios Prountzos, 2013

机译：非结构化数据并行程序的结构驱动优化

Performance optimization of memory-bound programs on data parallel accelerators.

摘要

著录项

相似文献

相关主题

期刊订阅