【24h】

Support for High-Frequency Streaming in CMPs

机译:支持CMP中的高频流

获取原文

摘要

As the industry moves toward larger-scale chip multiprocessors, the need to parallelize applications grows. High inter-thread communication delays, exacerbated by over-stressed high-latency memory subsystems and ever-increasing wire delays, require parallelization techniques to create partially or fully independent threads to improve performance. Unfortunately, developers and compilers alike often fail to find sufficient independent work of this kind. Recently proposed pipelined streaming techniques have shown significant promise for both manual and automatic parallelization. These techniques have wide-scale applicability because they embrace inter-thread dependences (albeit acyclic dependences) and tolerate long-latency communication of these dependences. This paper addresses the lack of architectural support for this type of concurrency, which has blocked its adoption and hindered related language and compiler research. We observe that both manual and automatic techniques create high-frequency streaming threads, with communication occurring every 5 to 20 instructions. Even while easily tolerating inter-thread transit delays, high-frequency communication makes thread performance very sensitive to intrathread delays from the repeated execution of the communication operations. Using this observation, we define the design-space and evaluate several mechanisms to find a better trade-off between performance and operating system, hardware, and design costs. From this, we find a light-weight streaming-aware enhancement to conventional memory subsystems that doubles the speed of these codes and is within 2% of the best-performing, but heavy-weight, hardware solution.
机译:随着行业朝着大规模芯片多处理器发展,并行化应用程序的需求也在增长。高线程间通信延迟会因过高的高延迟内存子系统以及不断增加的连线延迟而加剧,需要并行化技术来创建部分或完全独立的线程以提高性能。不幸的是,开发人员和编译人员都常常找不到足够的这种独立工作。最近提出的流水线流技术已显示出对手动和自动并行化的巨大希望。这些技术具有广泛的适用性,因为它们包含线程间依赖性(尽管是非循环依赖性),并且可以容忍这些依赖性的长时延通信。本文解决了这种并发缺乏架构支持的问题,这阻碍了它的采用,并阻碍了相关语言和编译器的研究。我们观察到,手动和自动技术都会创建高频流线程,每5到20条指令就会进行一次通信。即使容易容忍线程间传输延迟,高频通信也使线程性能对通信操作重复执行产生的线程内延迟非常敏感。使用此观察,我们定义了设计空间并评估了几种机制,以在性能与操作系统,硬件和设计成本之间找到更好的折衷方案。由此,我们发现传统存储子系统的轻量级流感知增强功能使这些代码的速度提高了一倍,并且在性能最佳但重量较重的硬件解决方案的2%之内。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号