首页> 外文学位 >Runtime Support toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems.
【24h】

Runtime Support toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems.

机译:在GPU加速的异构系统中实现对透明内存访问的运行时支持。

获取原文
获取原文并翻译 | 示例

摘要

GPU has become a popular parallel accelerator in modern heterogeneous systems for its great parallelism and superior energy efficiency. However, it also extremely complicates programing the memory system in such heterogeneous systems, due to the non-continuous memory spaces on CPU and GPU, and a two-level memory hierarchy on a GPU itself. The complexity of this memory system is now fully exposed to programmers, who must manually move data to each position in the memory for correctness, and must consider data layout and locality to adapt to data access patterns in multi-threaded parallel codes for performance.;In this Ph.D. thesis study, we approach the above problem through providing runtime system support that aims at both easier programming and better performance. Specifically, we show two such runtime system software approaches, either within the scope of a programming model or as a general system software.;With the first approach, we focus on two popular programming models, MapReduce and MPI. For GPU-based MapReduce, we provide a transparent GPU memory hierarchy for MapReduce developers and realize performance improvement through buffering data in GPU's shared memory, a small on-chip scratch-pad memory. On a system powered by an Nvidia GTX 280 GPU, our MapReduce outperforms a previous shared-memory-oblivious MapReduce, with a prominent Map phase speedup of 2.67x on average. For MPI, we extend its interface to enable the usage of GPU memory buffers directly in communications, and optimize such GPUinvolved MPI intra-node communication, through pipelining CPU-GPU data movement with inter-process communication, and GPU DMA-assisted data movement. Comparing to manually mixing GPU data movement with MPI communication, on a multi-core system equipped with three Nvidia Tesla Fermi GPUs, we show up to 2x bandwidth speedup through pipelining and an average 4.3% improvement to the total execution time of a halo exchange benchmark; our DMA-assisted intranode data communication further shows up to 1.4x bandwidth speedup between near GPUs, and a further 4.7% improvement on the benchmark.;With the second approach, we present the design of Region-based Software Virtual Memory (RSVM), a software virtual memory running on both CPU and GPU in an asynchronous and cooperative way. In addition to automatic GPU memory management and GPU-CPU data transfer, RSVM offers two novel features: 1) GPU kernel-issued on-demand data fetching from the host into the GPU memory, and 2) intra-kernel transparent GPU memory swapping into the main memory. Our study reveals important insights on the challenges and opportunities of building unified virtual memory systems for heterogeneous computing. Experimental results on real GPU benchmarks demonstrate that, though it incurs a small overhead, RSVM can transparently scale GPU kernels to large problem sizes exceeding the device memory size limit. It allows developers to write the same code for different problem sizes and further to optimize on data layout definition accordingly. Our evaluation also identifies missing GPU architecture features for better system software efficiency.
机译:GPU凭借其出色的并行性和卓越的能源效率,已成为现代异构系统中流行的并行加速器。然而,由于CPU和GPU上的非连续存储器空间以及GPU本身上的两级存储器层次结构,在这种异构系统中对存储器系统的编程也极为复杂。现在,这种内存系统的复杂性完全暴露给程序员,程序员必须手动将数据移到内存中的每个位置以确保正确性,并且必须考虑数据布局和位置以适应多线程并行代码中的数据访问模式以提高性能。在这个博士学位论文研究中,我们通过提供运行时系统支持来解决上述问题,该支持旨在简化编程并提高性能。具体来说,我们展示了两种这样的运行时系统软件方法,无论是在编程模型的范围内还是作为通用系统软件。;第一种方法,我们着重于两种流行的编程模型MapReduce和MPI。对于基于GPU的MapReduce,我们为MapReduce开发人员提供了透明的GPU内存层次结构,并通过在GPU的共享内存(一种小型片上暂存内存)中缓冲数据来实现性能提升。在由Nvidia GTX 280 GPU驱动的系统上,我们的MapReduce优于以前的共享内存忽略型MapReduce,其Map阶段平均速度显着提高了2.67倍。对于MPI,我们扩展其接口以直接在通信中使用GPU内存缓冲区,并通过将CPU-GPU数据移动与进程间通信进行流水线化以及GPU DMA辅助的数据移动来优化此类GPU涉及的MPI节点内通信。与手动混合GPU数据移动与MPI通信相比,在配备三个Nvidia Tesla Fermi GPU的多核系统上,我们通过流水线显示带宽提高了2倍,并且光环交换基准测试的总执行时间平均缩短了4.3% ;我们的DMA辅助节点内数据通信进一步显示了近GPU之间的带宽提高了1.4倍,与基准相比进一步提高了4.7%。采用第二种方法,我们提出了基于区域的软件虚拟内存(RSVM)的设计,在CPU和GPU上以异步和协作方式运行的软件虚拟内存。除了自动GPU内存管理和GPU-CPU数据传输,RSVM还提供两个新颖的功能:1)GPU内核发出的按需数据从主机读取到GPU内存,以及2)内核内透明GPU内存交换为主内存。我们的研究揭示了有关为异构计算构建统一的虚拟内存系统的挑战和机遇的重要见解。真实GPU基准测试的实验结果表明,尽管它产生了少量开销,但RSVM可以将GPU内核透明地扩展到超过设备内存大小限制的大问题大小。它允许开发人员针对不同的问题大小编写相同的代码,并进一步优化数据布局定义。我们的评估还确定了缺少的GPU架构功能,以提高系统软件效率。

著录项

  • 作者

    Ji, Feng.;

  • 作者单位

    North Carolina State University.;

  • 授予单位 North Carolina State University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 101 p.
  • 总页数 101
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:41:26

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号