首页> 外文会议>IEEE/ACM International Symposium on Microarchitecture >Complementing user-level coarse-grain parallelism with implicit speculative parallelism
【24h】

Complementing user-level coarse-grain parallelism with implicit speculative parallelism

机译:用隐含的推测性并行性补充用户级别的粗粒度并行性

获取原文

摘要

Multi-core and many-core systems are the norm in contemporary processor technology and are expected to remain so for the foreseeable future. Programs using parallel programming primitives like PThreads or OpenMP often exploit coarse-grain parallelism, because it offers a good trade-off between programming effort versus performance gain. Some parallel applications show limited or no scaling beyond a number of cores. Given the abundant number of cores expected in future many-cores, several cores would remain idle in such cases while execution performance stagnates. This paper proposes using cores that do not contribute to performance improvement for running implicit fine-grain speculative threads. In particular, we present a many-core architecture and protocol that allow applications with coarse-grain explicit parallelism to further exploit implicit speculative parallelism within each thread. Implicit speculative parallelism frees the programmer from the additional effort to explicitly partition the work into finer and properly synchronized tasks. Our results show that, for a many-core comprising of 128 cores supporting implicit speculative parallelism in clusters of 2 or 4 cores, performance improves on top of the highest scalability point by 41% on average for the 4-core cluster and by 27% on average for the 2-core cluster. These performance improvements come with an energy consumption that is close to - and sometimes better than - the baseline. This approach often leads to better performance and energy efficiency compared to existing alternatives such as Core Fusion and Frequency Boosting. We also investigate the tradeoffs between explicit and implicit threads as input dataset sizes vary. Finally, we present a dynamic mechanism to choose the number of explicit and implicit threads, which performs within 6% of the static oracle selection of threads.
机译:多核和多核系统是当今处理器技术的规范,并且在可预见的将来有望保持这种状态。使用并行编程原语(例如PThreads或OpenMP)的程序经常利用粗粒度并行性,因为它在编程工作量与性能增益之间提供了良好的折衷。某些并行应用程序显示扩展数量超出核心数量限制或没有扩展。考虑到将来的多核中将有大量的核,在这种情况下,几个核将保持空闲状态,而执行性能则停滞不前。本文提出了使用对运行隐式细粒度投机线程无益于性能提升的内核。特别是,我们提出了一种多核体系结构和协议,该体系结构和协议允许具有粗粒度显式并行性的应用程序进一步利用每个线程内的隐式推测并行性。隐式的推测并行性使程序员摆脱了将工作明确地划分为更精细和适当同步的任务的额外工作。我们的结果表明,对于包含2个或4个内核的群集中支持隐式推测并行性的128个内核的多内核,在最高可扩展点之上,性能对于4核群集平均提高了41%,而平均性能提高了27%平均而言,两核集群。这些性能改进带来的能耗接近于基线,有时甚至高于基线。与诸如Core Fusion和Frequency Boosting之类的现有替代方案相比,这种方法通常可以带来更好的性能和能效。随着输入数据集大小的变化,我们还将研究显式线程和隐式线程之间的权衡。最后,我们提出了一种动态机制来选择显式和隐式线程的数量,该机制执行静态oracle线程选择的6%以​​内。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号