With the proliferation of Chip Multiprocessors (CMPs), shared memory multi-threaded programs are expanding fast in every application domain. These programs exhibit execution characteristics that go beyond those observed in single-threaded programs, mainly due to data sharing and synchronization. To ensure that next generation CMPs will perform well on such anticipated workloads, it is vital to understand how these programs and architectures interact, and exploit the unique opportunities presented.ududThis thesis examines the time-varying execution characteristics of the shared memory workloads in conjunction to the synchronization points that exist in the programs. The main hypothesis is that the type, the position, and the repetitive execution of synchronization constructs can be exploited to unfold important execution phases and enable new optimization opportunities. The research provides a simple application-driven approach for predicting the program behavior and effectively driving dynamic performance optimization and resource management actions in future CMPs.ududIn the first part of this thesis, I show how synchronization points relate to various program-wide periodic behaviors. Based on the observations, I develop a framework where user-level synchronization primitives are exposed to the hardware and monitored to detect program phases and guide dynamic adaptation. Through workload-driven evaluation, I demonstrate the effectiveness of the framework in improving the performance/power in on-chip interconnects. ududThe second part of the thesis explores in depth the inter-thread communication behaviors. I show that although synchronization points under the shared memory model do not expose any communication details, they indicate well the points where coherence communication patterns change or repeat. By leveraging this property, I design a synchronization-point-based coherence predictor that uncovers communication patterns with high accuracy, while consuming significantly less hardware resources compared to existing predictors. In the last part, I investigate the underlying reasons causing threads to wait in synchronization points, wasting resources. I show that these reasons can vary even across different programs phases, and existing critical-path predictors can render ineffective under certain conditions. I then present a new scheme that improves predictability by incorporating history information from previous points. The new design is robust and can amortize the run-time imbalances to improve the system's performance and/or energy.
展开▼
机译:随着芯片多处理器(CMP)的发展,共享内存多线程程序在每个应用程序领域中都在快速扩展。这些程序的执行特性超出了单线程程序中观察到的特性,主要是由于数据共享和同步。为了确保下一代CMP在这种预期的工作负载下能很好地运行,至关重要的是要了解这些程序和体系结构是如何交互的,并利用所提供的独特机会。 ud ud本文研究了共享内存工作负载的时变执行特征。结合程序中存在的同步点。主要假设是,可以利用同步结构的类型,位置和重复执行来展开重要的执行阶段并启用新的优化机会。该研究提供了一种简单的应用程序驱动的方法,用于预测程序行为并有效地驱动未来CMP中的动态性能优化和资源管理操作。 ud ud在本文的第一部分中,我展示了同步点如何与程序范围内的各种程序相关联周期性行为。基于这些观察,我开发了一个框架,在该框架中,用户级同步原语暴露于硬件并受到监视,以检测程序阶段并指导动态适应。通过工作量驱动的评估,我演示了该框架在改善片上互连的性能/功耗方面的有效性。 ud ud论文的第二部分深入探讨了线程间通信行为。我表明,尽管共享内存模型下的同步点未公开任何通信细节,但它们可以很好地指示一致性通信模式发生更改或重复的点。通过利用此属性,我设计了一个基于同步点的一致性预测器,该预测器可以以很高的准确性发现通信模式,同时与现有的预测器相比,所消耗的硬件资源要少得多。在最后一部分中,我研究了导致线程在同步点中等待,浪费资源的根本原因。我表明,即使在不同的程序阶段,这些原因也会有所不同,并且现有的关键路径预测变量在某些条件下可能会失效。然后,我提出了一个新方案,该方案通过合并以前的历史信息来提高可预测性。新设计坚固耐用,可以分摊运行时间不平衡,以提高系统的性能和/或能耗。
展开▼