首页> 外文学位 >Performance optimization of memory-bound programs on data parallel accelerators.
【24h】

Performance optimization of memory-bound programs on data parallel accelerators.

机译:在数据并行加速器上优化内存绑定程序的性能。

获取原文
获取原文并翻译 | 示例

摘要

High performance applications depend on high utilization of memory bandwidth and computing resources, and data parallel accelerators have proven to be very effective in providing both, when needed. However, memory bound programs push the limits of system bandwidth, causing under-utilization in computing resources and thus energy inefficient executions. The objective of this research is to investigate opportunities on data parallel accelerators (i.e., SIMD units and GPUs) and design solutions for improving the performance of three classes of memory-bound applications: stencil computation, sparse matrix-vector multiplication (SpVM) and graph analytics.;This research first focuses on performance bottlenecks of stencil computations on short-vector SIMD ISAs and presents StVEC, a hardware-based solution for extending the vector ISA and improving data movement and bandwidth utilization. StVEC includes an extension to the standard addressing mode of vector floating-point instructions in contemporary vector ISAs (e.g. SSE, AVX, VMX). A code generation approach is designed and implemented to help a vectorizing compiler generate code for processors with StVEC extensions. Using an optimistic as well as a pessimistic emulation of the proposed StVEC instructions, it is shown that the proposed solution can be effective on top of SSE and AVX capable processors. To analyze hardware overhead, parts of the proposed design are synthesized using a 45nm CMOS library and shown to have minimal impact on processor cycle time.;As the second class of memory-bound programs, this research has focused on sparse matrix-vector multiplications (SpMV) on GPUs and shown that no sparse matrix representation is consistently superior, with the best representation being dependent on the matrix sparsity patterns. This part focuses on four standard sparse representations (i.e. CSR, ELL, COO and a hybrid ELL-COO) and studies the correlations between SpMV performance and the sparsity features. The research then uses machine learning techniques to automatically select the best sparse representation for a given matrix. Extensive characterization of pertinent sparsity features is performed on around 700 sparse matrices and their SpMV performance with different sparse representations. Applying learning on such a rich dataset leads to developing a decision model to automatically select the best representation for a given sparse matrix on a given target GPU. Experimental results on three GPUs demonstrate that the approach is very effective in selecting the best representation.;The last part is dedicated to characterizing performance of graph processing systems on GPUs. It focuses on a vertex-centric graph programming framework (Virtual Warp Centric, VWC), and characterizes performance bottlenecks when running different graph primitives. The analysis shows how sensitive the VWC parameter is to the input graph and signifies the importance of selecting the correct warp size in order to avoid performance penalties. The study also applies machine learning techniques on the input dataset in order to predict the best VWC configuration for a given graph. It shows the applicability of simple machine learning models to improve performance and reduce the auto-tuning time for graph algorithms on GPUs.
机译:高性能应用程序依赖于内存带宽和计算资源的高利用率,并且事实证明,在需要时,数据并行加速器在提供这两者方面非常有效。但是,受内存限制的程序会限制系统带宽,从而导致计算资源的利用不足,从而导致能源效率低下的执行。这项研究的目的是研究数据并行加速器(即SIMD单元和GPU)上的机会,并设计解决方案以提高三类内存绑定应用程序的性能:模具计算,稀疏矩阵矢量乘法(SpVM)和图形这项研究首先关注于短向量SIMD ISA上模板计算的性能瓶颈,并提出了StVEC,这是一种基于硬件的解决方案,用于扩展向量ISA并改善数据移动和带宽利用率。 StVEC包括对现代向量ISA(例如SSE,AVX,VMX)中向量浮点指令的标准寻址模式的扩展。设计并实现了一种代码生成方法,以帮助矢量化编译器为带有StVEC扩展的处理器生成代码。通过对拟议的StVEC指令进行乐观和悲观的仿真,结果表明,所提出的解决方案可以在具有SSE和AVX功能的处理器之上有效。为了分析硬件开销,使用45nm CMOS库对拟议设计的部分进行了合成,并显示出对处理器周期时间的影响最小。作为第二类内存绑定程序,本研究着重于稀疏矩阵向量乘法( SpMV),并显示没有稀疏的矩阵表示形式始终具有优越性,最佳表示形式取决于矩阵稀疏性模式。本部分重点介绍四种标准的稀疏表示形式(即CSR,ELL,COO和混合ELL-COO),并研究SpMV性能与稀疏特征之间的相关性。然后,研究使用机器学习技术为给定的矩阵自动选择最佳的稀疏表示。在大约700个稀疏矩阵及其具有不同稀疏表示的SpMV性能上进行了相关稀疏特征的广泛表征。在如此丰富的数据集上应用学习会导致开发决策模型,以自动为给定目标GPU上的给定稀疏矩阵选择最佳表示。在三个GPU上的实验结果表明,该方法在选择最佳表示形式方面非常有效。最后一部分致力于表征GPU上图形处理系统的性能。它着重于以顶点为中心的图形编程框架(Virtual Warp Centric,VWC),并描述了运行不同图形图元时的性能瓶颈。分析显示VWC参数对输入图的敏感程度,并表明选择正确的经纱尺寸以避免性能损失的重要性。这项研究还对输入数据集应用了机器学习技术,以预测给定图的最佳VWC配置。它显示了简单的机器学习模型的适用性,以提高性能并减少GPU上图形算法的自动调整时间。

著录项

  • 作者单位

    The Ohio State University.;

  • 授予单位 The Ohio State University.;
  • 学科 Engineering.;Computer science.;Computer engineering.
  • 学位 Ph.D.
  • 年度 2016
  • 页码 166 p.
  • 总页数 166
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号