首页> 外文会议>ICNC 2012 >Asynchronous Memory Machine Models with Barrier Synchronization
【24h】

Asynchronous Memory Machine Models with Barrier Synchronization

机译:具有屏障同步的异步内存机器型号

获取原文

摘要

The Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. It was assumed that warps (i.e. groups of threads) on the DMM and the UMM work synchronously in the round-robin manner. However, warps work asynchronously in the actual GPUs, in the sense that warps may be randomly (or arbitrarily) dispatched for execution. The first contribution of this paper is to introduce an asynchronous version of the DMM and the UMM, in which warps are arbitrarily dispatched. Instead, we assume that threads can execute the "syncthreads" instruction for barrier synchronization. Since the barrier synchronization operation is costly, we should evaluate and minimize the number of barrier synchronization operations performed by parallel algorithms. The second contribution of this paper is to show a parallel algorithm to compute the sum of n numbers in optimal computing time and few barrier synchronization steps. Our parallel algorithm computes the sum of n numbers in O( n/w + l log n) time units and O(log l/w + log log w) barrier synchronization steps using wl threads both on the asynchronous DMM and on the asynchronous UMM with width w and latency l. We also prove that the computing time is optimal because it matches the theoretical lower bound. Quite surprisingly, the number of barrier synchronization steps and the number of threads are independent of n. Even if the input size n is quite large, our parallel algorithm computes the sum in optimal time units and a fixed number of syncthreads using a fixed number of threads.
机译:离散存储器(DMM)和Unified Memory Machine(UMM)是捕获共享内存的本质和GPU的全局存储器的理论并行计算模型。假设DMM上的经线(即线程)和umm以循环方式同步地工作。然而,扭曲在实际的GPU中异步工作,在派遣中可能是随机的(或任意)被调度的感知。本文的第一种贡献是引入DMM和UMM的异步版本,其中WERPS被任意分派。相反,我们假设线程可以执行“Syncthreads”指令进行屏障同步。由于屏障同步操作成本高,我们应该评估并最小化并行算法执行的屏障同步操作的数量。本文的第二贡献是显示并行算法,以在最佳计算时间和少量屏障同步步骤中计算n个数字的总和。我们的并行算法计算O(n / w + l log n)时间单位和o(log l / w + log log log w)的N个数字的总和在异步DMM和异步umm上使用WL线程使用WL线程的屏障同步步骤宽度w和延迟l。我们还证明了计算时间是最佳的,因为它与理论下限匹配。非常令人惊讶的是,屏障同步步骤的数量和线程的数量与n无关。即使输入大小n相当大,我们的并行算法也计使用固定数量的线程在最佳时间单位和固定数量的Syncthread中计算总和。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号