Thread-Level Locking for SIMT Architectures

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Thread-Level Locking for SIMT Architectures

【24h】

Thread-Level Locking for SIMT Architectures

机译：SIMT体系结构的线程级锁定

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

As more emerging applications are moving to GPUs, thread-level synchronization has become a requirement. However, GPUs only provide warp-level and thread-block-level rather than thread-level synchronization. Moreover, it is highly possible to cause live-locks by using CPU synchronization mechanisms to implement thread-level synchronization for GPUs. In this article, we first propose a software-based thread-level synchronization mechanism called lock stealing for GPUs to avoid live-locks. We then describe how to implement our lock stealing algorithm in mutual exclusive locks and readers-writer locks with high performance. Finally, by putting it all together, we develop a thread-level locking library (TLLL) for commercial GPUs. To evaluate TLLL and show its general applicability, we use it to implement six widely used programs. We compare TLLL against the state-of-the-art ad-hoc GPU synchronization, GPU software transactional memory (STM), and CPU hardware transactional memory (HTM), respectively. The results show that, compared with the ad-hoc GPU synchronization for Delaunay mesh refinement (DMR), TLLL improves the performance by 22 percent on average on a GTX970 GPU, and shows up to 11 percent of performance improvement on a Volta V100 GPU. Moreover, it significantly reduces the required memory size. Such low memory consumption enables DMR to successfully run on the GTX970 GPU with the 10-million mesh size, and the V100 GPU with the 40-million mesh size, with which the ad-hoc synchronization can not run successfully. In addition, TLLL outperforms the GPU STM by 65 percent, and the CPU HTM (running on a Xeon E5-2620 v4 CPU with 16 hardware threads) by 43 percent on average.

机译：随着越来越多的新兴应用程序转向GPU，线程级同步已成为必需。但是，GPU仅提供扭曲级和线程块级，而不提供线程级同步。此外，通过使用CPU同步机制为GPU实现线程级同步，极有可能导致活动锁定。在本文中，我们首先提出了一种基于软件的线程级同步机制，该机制称为GPU窃取以避免实时锁定。然后，我们描述如何在互斥锁和高性能读写器锁中实现我们的锁窃取算法。最后，通过将它们放在一起，我们为商业GPU开发了线程级锁定库（TLLL）。为了评估TLLL并显示其一般适用性，我们使用它来实现六个广泛使用的程序。我们将TLLL与最新的临时GPU同步，GPU软件事务存储器（STM）和CPU硬件事务存储器（HTM）进行了比较。结果表明，与用于Delaunay网格细化（DMR）的即席GPU同步相比，TLLL在GTX970 GPU上的性能平均提高了22％，在Volta V100 GPU上的性能提高了11％。而且，它大大减少了所需的内存大小。如此低的内存消耗使DMR可以成功地在具有1000万网格大小的GTX970 GPU和在具有4000万网格大小的V100 GPU上运行，而临时同步无法成功运行。此外，TLLL的性能平均要比GPU STM高出65％，CPU HTM（在具有16个硬件线程的Xeon E5-2620 v4 CPU上运行）平均要高出43％。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems 》 |2020年第5期| 1121-1136| 共16页
作者

展开▼
作者单位

Capital Normal Univ Coll Informat Engn Beijing 100048 Peoples R China|Capital Normal Univ Beijing Key Lab Elect Syst Reliabil & Prognost Beijing 100048 Peoples R China;

Xi An Jiao Tong Univ Xian Peoples R China;

Beihang Univ Sch Comp Sci & Engn Beijing 100191 Peoples R China;

Chinese Acad Sci Shenzhen Inst Adv Technol Shenzhen 518055 Peoples R China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Graphics processing units; Synchronization; Message systems; System recovery; Instruction sets; Hardware; Computer architecture; Deadlocks; parallelism and concurrency; runtime environments; SIMD processors; synchronization;

机译：图形处理单元;同步;消息系统;系统恢复;指令集;硬件;计算机架构;僵局;并行性和并发性;运行时环境;SIMD处理器;同步化;

相似文献

外文文献
中文文献
专利

1. A simple method for rejection sampling efficiency improvement on SIMT architectures [J] . Ridley Gavin, Forget Benoit Statistics and computing . 2021 ,第3期

机译：一种简单的抑制SIMT架构采样效率改进方法
2. Loop Optimization for Divergence Reduction on GPUs with SIMT Architecture [J] . Novak Roman Parallel and Distributed Systems, IEEE Transactions on . 2015 ,第6期

机译：采用SIMT架构的循环优化可减少GPU上的发散
3. Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures [J] . Ji Kim, Christopher Torng, Shreesha Srinath, Computer architecture news . 2013 ,第3期

机译：SIMT体系结构中利用价值结构的微体系结构机制
4. Thread block compaction for efficient SIMT control flow [C] . Fung Wilson W. L., Aamodt Tor M. IEEE 17th International Symposium on High Performance Computer Architecture . 2011

机译：线程块压缩可实现高效的SIMT控制流程
5. Exploiting Thread-Level Parallelism on Reconfigurable Architectures: a Cross-Layer Approach [D] . Momeni, Amir. 2017

机译：在可重构体系结构上利用线程级并行性：一种跨层方法
6. Exploiting Thread-Level and Instruction-Level Parallelism to Cluster Mass Spectrometry Data using Multicore Architectures [O] . Fahad Saeed, Jason D. Hoffert, Trairak Pisitkun, -1

机译：利用多核体系结构利用线程级和指令级并行性对质谱数据进行聚类
7. A Case for a Flexible Scalar Unit in SIMT Architecture [O] . 2014

机译：sImT架构中灵活标量单元的一个案例

Thread-Level Locking for SIMT Architectures

摘要

著录项

相似文献

相关主题

期刊订阅