...
首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Thread-Level Locking for SIMT Architectures
【24h】

Thread-Level Locking for SIMT Architectures

机译:SIMT体系结构的线程级锁定

获取原文
获取原文并翻译 | 示例

摘要

As more emerging applications are moving to GPUs, thread-level synchronization has become a requirement. However, GPUs only provide warp-level and thread-block-level rather than thread-level synchronization. Moreover, it is highly possible to cause live-locks by using CPU synchronization mechanisms to implement thread-level synchronization for GPUs. In this article, we first propose a software-based thread-level synchronization mechanism called lock stealing for GPUs to avoid live-locks. We then describe how to implement our lock stealing algorithm in mutual exclusive locks and readers-writer locks with high performance. Finally, by putting it all together, we develop a thread-level locking library (TLLL) for commercial GPUs. To evaluate TLLL and show its general applicability, we use it to implement six widely used programs. We compare TLLL against the state-of-the-art ad-hoc GPU synchronization, GPU software transactional memory (STM), and CPU hardware transactional memory (HTM), respectively. The results show that, compared with the ad-hoc GPU synchronization for Delaunay mesh refinement (DMR), TLLL improves the performance by 22 percent on average on a GTX970 GPU, and shows up to 11 percent of performance improvement on a Volta V100 GPU. Moreover, it significantly reduces the required memory size. Such low memory consumption enables DMR to successfully run on the GTX970 GPU with the 10-million mesh size, and the V100 GPU with the 40-million mesh size, with which the ad-hoc synchronization can not run successfully. In addition, TLLL outperforms the GPU STM by 65 percent, and the CPU HTM (running on a Xeon E5-2620 v4 CPU with 16 hardware threads) by 43 percent on average.
机译:随着越来越多的新兴应用程序转向GPU,线程级同步已成为必需。但是,GPU仅提供扭曲级和线程块级,而不提供线程级同步。此外,通过使用CPU同步机制为GPU实现线程级同步,极有可能导致活动锁定。在本文中,我们首先提出了一种基于软件的线程级同步机制,该机制称为GPU窃取以避免实时锁定。然后,我们描述如何在互斥锁和高性能读写器锁中实现我们的锁窃取算法。最后,通过将它们放在一起,我们为商业GPU开发了线程级锁定库(TLLL)。为了评估TLLL并显示其一般适用性,我们使用它来实现六个广泛使用的程序。我们将TLLL与最新的临时GPU同步,GPU软件事务存储器(STM)和CPU硬件事务存储器(HTM)进行了比较。结果表明,与用于Delaunay网格细化(DMR)的即席GPU同步相比,TLLL在GTX970 GPU上的性能平均提高了22%,在Volta V100 GPU上的性能提高了11%。而且,它大大减少了所需的内存大小。如此低的内存消耗使DMR可以成功地在具有1000万网格大小的GTX970 GPU和在具有4000万网格大小的V100 GPU上运行,而临时同步无法成功运行。此外,TLLL的性能平均要比GPU STM高出65%,CPU HTM(在具有16个硬件线程的Xeon E5-2620 v4 CPU上运行)平均要高出43%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号