首页> 外文学位 >Thread criticality and TLB enhancement techniques for chip multiprocessors.

【24h】

Thread criticality and TLB enhancement techniques for chip multiprocessors.

机译：芯片多处理器的线程重要性和TLB增强技术。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Numerous technology trends including debilitating power densities and rising verification costs have recently prompted a shift to multicore or chip multiprocessor (CMP) architectures. Despite their benefits, CMPs face a number of design challenges. A key challenge is how best to architect the on-chip memory hierarchy, which plays a key role in determining system performance and power characteristics.This thesis presents a top-down analysis, from the application-level down to the microarchitectural layer, of the role of the on-chip memory hierarchy in determining the performance and power of emerging parallel workloads. Analysis shows that two primary sources of overhead in parallel program performance arise due to imperfections in the on-chip memory. The first is the variation in execution speeds that multiple threads of a parallel program experience. As this thesis will show, this difference in thread criticality results in performance and energy degradation. The second source of overhead arises from the fact that emerging parallel workloads tend to stress their Translation Lookaside Buffers (TLBs) significantly. As application working sets increase, we show that modern TLBs experience notable miss rates, resulting in performance overheads.Based on these observations, this thesis presents the first full-system characterization of the roles of thread criticality and TLB behavior in determining system performance. Using a combination of real-system profiling, full-system simulation, and FPGA-based emulation techniques, this thesis characterizes the causes of thread criticality and increasing TLB pressure. First, this work shows that cache misses are the primary cause of differing thread speeds. Specifically, threads that experience a greater number of cache misses run slower than their better-cached counterparts. Using this simple but powerful intuition, this thesis proposes thread criticality predictors with 93% accuracy. This thesis will also explore the usefulness of these criticality predictors for various resource management techniques on CMPs. Second, this work then characterizes the prevalence of TLB misses, showing that while parallel workloads experience high TLB miss rates, 30% to 95% of them can be classified as predictable. This predictability arises in two ways. First, multiple cores often TLB miss on the same translation. Second, cores often TLB miss on entries with virtual pages placed a predictable stride from one another.This thesis then builds upon our workload characterization by proposing techniques to improve the on-chip memory hierarchy. First, I show how cache-based thread criticality prediction can improve parallel program performance by off-loading work from critical to non-critical threads. Specifically, Intel TBB's task stealing mechanism is augmented with criticality prediction to yield 21% average performance improvements. Second, this thesis shows that by estimating which threads are non-critical and by how much, critical threads may be run at a high clock rate while the others are slowed down, achieving 15% average energy savings. While this thesis focuses on these specific applications, we discuss the versatility of thread criticality prediction and how it may be applied in additional scenarios.This thesis then uses the TLB characterization to propose TLB enhancement techniques. By leveraging the classes of predictable TLB misses, we propose and evaluate two techniques that use inter-core cooperation to eliminate TLB misses. First, I show the benefits of Inter-Core Cooperative (ICC) prefetching schemes, in which Leader-Follower prefetching exploits TLB misses experienced by multiple cores while Distance-based Cross-Core prefetching captures the presence of regular inter-core strides. Combining these approaches, ICC prefetching techniques can eliminate 19% to 90% of system misses. I then propose an alternative to ICC prefetching, Shared Last-Level (SLL) TLBs, which eliminate 7% to 79% of system TLB misses.Overall, this thesis is the first to show the importance of thread criticality and TLB enhancement techniques for parallel programs on CMPs. Moreover, as CMPs experience increased core counts, heterogeneity, and application memory footprints increase, these techniques will be essential in apportioning system resources intelligently among multiple contending threads.

机译：最近，包括功率密度下降和验证成本上升在内的众多技术趋势促使人们转向多核或芯片多处理器（CMP）架构。尽管具有优势，但CMP仍面临许多设计挑战。一个关键的挑战是如何最好地构建片上存储器层次结构，这在确定系统性能和功耗特性方面起着关键作用。本文提出了从应用程序层到微体系结构层的自上而下的分析。片上内存层次结构在确定新兴并行工作负载的性能和功能方面的重要作用。分析表明，由于片上存储器中的缺陷，导致并行程序性能开销的两个主要来源。首先是并行程序的多个线程经历的执行速度的变化。如本论文将显示的那样，线程临界度的这种差异会导致性能和能量下降。造成开销的第二个原因是，新出现的并行工作负载往往会极大地强调其转换后备缓冲区（TLB）。随着应用程序工作集的增加，我们表明现代TLB经历了显着的未命中率，从而导致了性能开销。基于这些观察，本文提出了线程重要性和TLB行为在确定系统性能中的作用的第一个完整系统表征。结合实际系统分析，完整系统仿真和基于FPGA的仿真技术，本文描述了导致线程临界和TLB压力增加的原因。首先，这项工作表明，高速缓存未命中是线程速度不同的主要原因。具体来说，遇到高速缓存未命中次数更多的线程比高速缓存的线程运行得慢。利用这种简单而强大的直觉，本文提出了线程关键性预测器，其准确度达93％。本文还将探讨这些关键性预测因子对CMP上各种资源管理技术的有用性。其次，这项工作描述了TLB丢失的普遍性，表明虽然并行工作负载的TLB丢失率很高，但其中30％到95％的可归类为可预测的。这种可预测性以两种方式产生。首先，TLB经常会错过多个内核，而不会进行相同的翻译。其次，TLB核心经常会错过带有虚拟页面的条目，彼此之间的步伐可预测。本论文然后通过提出改进片上存储器层次结构的技术，基于工作负载的表征。首先，我展示了基于缓存的线程关键性预测如何通过将工作从关键线程卸载到非关键线程来提高并行程序性能。具体来说，英特尔®TBB的任务窃取机制通过关键性预测得到增强，可将平均性能提高21％。其次，本论文表明，通过估计哪些线程不是关键线程以及多少线程，关键线程可以高时钟速率运行，而其他线程则变慢，从而平均节省了15％的能源。尽管本文主要针对这些特定应用，但我们讨论了线程关键性预测的多功能性以及如何将其应用于其他场景。本文然后使用TLB表征提出了TLB增强技术。通过利用可预测的TLB丢失类别，我们提出并评估了两种使用内核间合作消除TLB丢失的技术。首先，我展示了核心间协作（ICC）预取方案的好处，其中Leader-Follower预取利用了多个核心经历的TLB丢失，而基于距离的跨核心预取则捕获了常规核心间跨步的情况。结合这些方法，ICC预取技术可以消除19％至90％的系统错误。然后，我提出了一种ICC预取的替代方法，即共享最后一级（SLL）TLB，它可以消除7％至79％的系统TLB遗漏。总体而言，本论文首次显示了线程关键性和TLB增强技术对于并行处理的重要性。 CMP上的程序。此外，随着CMP的核心数量增加，异构性和应用程序内存占用量的增加，这些技术对于在多个竞争线程之间智能分配系统资源至关重要。

著录项

作者
Bhattacharjee, Abhishek.;
展开▼
作者单位

Princeton University.;

展开▼
授予单位 Princeton University.;
学科 Engineering Computer.Engineering Electronics and Electrical.Computer Science.
学位 Ph.D.
年度 2010
页码 157 p.
总页数 157
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Thread Criticality Assisted Replication and Migration for Chip Multiprocessor Caches [J] . Jianhua Li, Minming Li, Chun Jason Xue, IEEE Transactions on Computers . 2017,第10期

机译：线程关键性辅助芯片多处理器缓存的复制和迁移
2. Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors [J] . Abhishek Bhattacharjee, Margaret Martonosi Computer architecture news . 2009,第3期

机译：芯片多处理器中用于动态性能，电源和资源管理的线程临界预测器
3. Evaluation of a numerical model for tapered threaded connections subjected to combined loading using enhanced experimental measurement techniques [J] . Galle Timothy, De Waele Wim, Van Wittenberghe Jeroen, The Journal of Strain Analysis for Engineering Design . 2015,第8期

机译：使用增强的实验测量技术评估承受组合载荷的锥形螺纹连接的数值模型
4. A Single-Cycle-Access 128-Entry Fully Associative TLB for Multi-Core Multi-Threaded Server-on-a-Chip [C] . Shastry, S., Bhatia, . 2007

机译：单核访问128项完全关联TLB，用于多核多线程片上服务器
5. Thread scheduling for chip multiprocessors. [D] . Bhadauria, Major Balram. 2009

机译：芯片多处理器的线程调度。
6. A 1.15 μW 200 kS/s 10-b Monotonic SAR ADC Using Dual On-Chip Calibrations and Accuracy Enhancement Techniques [O] . Jae-Hun Lee, Dasom Park, Woojin Cho, 2018

机译：采用双重片内校准和精度增强技术的1.15μW200 kS / s 10-b单调SAR ADC
7. Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors [O] . Abhishek Bhattacharjee, Margaret Martonosi 2009

机译：芯片多处理器中用于动态性能，电源和资源管理的线程临界预测器

Thread criticality and TLB enhancement techniques for chip multiprocessors.

摘要

著录项

相似文献

相关主题

期刊订阅