首页> 外文学位 >Architectural support for scalable speculative parallelization in shared-memory multiprocessors.
【24h】

Architectural support for scalable speculative parallelization in shared-memory multiprocessors.

机译:对共享内存多处理器中的可伸缩投机并行化的体系结构支持。

获取原文
获取原文并翻译 | 示例

摘要

Speculative parallelization aggressively executes in parallel codes that cannot be fully parallelized by the compiler. Past proposals of hardware schemes have mostly focused on single-chip multiprocessors (CMPs), whose effectiveness is necessarily limited by their small size. Very few schemes have attempted this technique in the context of scalable shared-memory systems.; In this thesis, we present and evaluate a new hardware scheme for scalable speculative parallelization. This design needs relatively simple hardware and is efficiently integrated into a cache-coherent NUMA system. We have designed of the node. We effectively utilize a speculative CMP as the building block for our scheme.; Simulations show that the architecture proposed delivers good speedups at a modest hardware cost. For a set of important nonanalyzable scientific loops, we report average speedups of 5.2 for 16 processors. We show that support for per-word speculative state is required by our applications, or else the performance suffers greatly.; With speculative parallelization, codes that cannot be fully compiler-analyzed are aggressively executed in parallel. If the hardware detects a cross-thread dependence violation at run time, it squashes offending threads and reverts to a safe state. Squashing can cripple performance, especially in scalable multiprocessors and systems that do not support speculative state at the fine granularity of memory words.; In this thesis, we also propose a new approach to reduce the cost of handling cross-thread data dependence violations: run-time learning. Using a new module called the Violation Prediction Table, the hardware learns to stall a thread when it seems likely to trigger a squash, and to release it when it is unlikely to trigger one. Simulations of a 16-processor scalable system show that the scheme is very effective. For a protocol that keeps speculation state on a per-line basis at the system level, learning eliminates on average 84% of the squashes. The resulting system runs on average 43% faster, and its performance is very close to a system with perfect prediction.
机译:推测性并行化积极地执行编译器无法完全并行化的并行代码。过去的硬件方案建议主要集中在单芯片多处理器(CMP)上,其有效性必然受到其小尺寸的限制。在可伸缩共享内存系统的上下文中,很少有方案尝试过这种技术。在本文中,我们提出并评估了一种用于可扩展的推测并行化的新硬件方案。该设计需要相对简单的硬件,并且可以有效地集成到与缓存相关的NUMA系统中。我们已经设计了节点。我们有效地利用了投机性CMP作为我们计划的基础。仿真表明,所提出的体系结构以适度的硬件成本实现了良好的加速。对于一组重要的不可分析的科学循环,我们报告16个处理器的平均加速比为5.2。我们表明,我们的应用程序需要支持每个单词的推测状态,否则性能会受到很大影响。通过推测性并行化,无法完全执行编译器分析的代码将并行执行。如果硬件在运行时检测到跨线程相关性违规,则它将压缩有问题的线程并恢复为安全状态。压缩会降低性能,尤其是在可扩展的多处理器和不支持以存储字的精细粒度进行推测状态的系统中。在本文中,我们还提出了一种新的方法来减少处理跨线程数据依赖冲突的成本:运行时学习。硬件使用称为的新模块,硬件学会在似乎有可能触发压扁的情况下停止线程,并在不太可能触发压扁的情况下释放线程。 16处理器可扩展系统的仿真表明该方案非常有效。对于在系统级别上按行保持推测状态的协议,学习平均可消除84%的南瓜。生成的系统平均运行速度提高了43%,其性能非常接近具有完美预测的系统。

著录项

  • 作者

    Cintra, Marcelo Hehl.;

  • 作者单位

    University of Illinois at Urbana-Champaign.;

  • 授予单位 University of Illinois at Urbana-Champaign.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2001
  • 页码 108 p.
  • 总页数 108
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

  • 入库时间 2022-08-17 11:47:10

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号