【24h】

GPU Computations and Memory Access Model Based on Petri Nets

机译:基于Petri网的GPU计算和内存访问模型

获取原文
获取原文并翻译 | 示例

摘要

In modern systems CPUs as well as GPUs are equipped with multi-level memory architectures, where different levels of the hierarchy vary in latency and capacity. Therefore, various memory access models were studied. Such a model can be seen as an interface abstracting the user from the physical architecture details. In this paper we present a general and uniform GPU computation and memory access model based on bounded inhibitor Petri nets (PNs). Its effectiveness is demonstrated by comparing its throughputs to practical computational experiments performed with the usage of Nvidia GPU with CUDA architecture. Our PN model is consistent with the workflow of multithreaded GPU streaming multiprocessors. It models a selection and execution of instructions for each warp. The three types of instructions included in the model are: the arithmetic operation, the access to the shared memory and the access to the global memory. For a given algorithm the model allows to check how efficient the parallelization is, and whether a different organization of threads will improve performance. The accuracy of our model was tested with different kernels. As the preliminary experiments we used the matrix multiplication program and stability example created by Nvidia, and as the main experiment a binary version of the least significant digit radix sort algorithm. We created three implementations of the algorithm using CUDA architecture, differing in the usage of shared and global memory as well as organization of calculations. For each implementation the PN model was used and the results of experiments are presented in the work.
机译:在现代系统中,CPU和GPU都配备了多级内存体系结构,其中层次结构的不同级别在延迟和容量方面有所不同。因此,研究了各种存储器访问模型。这样的模型可以看作是从物理体系结构细节中抽象出用户的界面。在本文中,我们提出了基于有界抑制器Pet​​ri网(PNs)的通用且统一的GPU计算和内存访问模型。通过将其吞吐量与使用具有CUDA架构的Nvidia GPU进行的实际计算实验进行比较,证明了其有效性。我们的PN模型与多线程GPU流多处理器的工作流程一致。它为每个经纱建模和选择指令。模型中包括的三种类型的指令是:算术运算,对共享内存的访问和对全局内存的访问。对于给定的算法,该模型允许检查并行化的效率如何,以及不同的线程组织是否将提高性能。我们使用不同的内核测试了模型的准确性。作为初步实验,我们使用了Nvidia创建的矩阵乘法程序和稳定性示例,而作为主要实验,则使用了最低有效数字基数排序算法的二进制版本。我们使用CUDA架构创建了该算法的三种实现,它们在共享和全局内存的使用以及计算的组织方面有所不同。对于每种实现,都使用PN模型,并在工作中介绍了实验结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号