With the emergence of highly multithreaded architectures, an effective performance monitoring system must reflect the interaction between a large number of concurrent events, and associate the overall effect of individual events and inefficiencies to the operations in the application source code. The state-of-the-art performance counters in highly multithreaded graphic processors currently do not provide this level of precision. Although fine-grained sampling of performance counters after each source-level operation could potentially achieve the desired precision, the high frequency of sampling required will likely cause too much distortion to the actual application behavior and make the sampled counter values inaccurate.In this thesis, I present a novel software-based approach for monitoring the memory hierarchy performance in highly multithreaded general-purpose graphics processors. The proposed analysis is based on memory traces collected for small snapshots of application execution. A trace-based memory hierarchy model with a Monte Carlo experimental methodology generates statistical bounds of performance measures in the presence of nonuniform thread interleaving and data sharing in a highly multithreaded execution environment. The statistical approach overcomes the classical problem of disturbed execution timing due to instrumentation. The approach scales well as I deploy a minimal sampling technique to reduce the trace generation overhead and model simulation time.The proposed scheme also keeps track of individual memory operations in the source code and can quantify the amount of their contribution to detrimental effects on memory system performance. A cross-validation of the model results shows close agreement with the values read from the hardware performance counters on an NVIDIA Tesla C2050. I later use the predicted memory hierarchy performance statistics in an analytical model to identify performance characteristics of a kernel and its expected execution time. To account for the systematic error present in the predictions, I approximate theerror function and express a range of potential true execution times for each predicted value.
展开▼
机译:随着高度多线程的体系结构的出现,有效的性能监视系统必须反映大量并发事件之间的交互作用,并将单个事件的整体效果和效率低下与应用程序源代码中的操作相关联。高度多线程的图形处理器中最新的性能计数器目前无法提供这种精度。尽管在每个源级操作之后对性能计数器进行细粒度采样可能会达到所需的精度,但所需的高采样频率可能会对实际应用行为造成太大的失真,并使采样的计数器值不准确。我提出了一种基于软件的新颖方法,用于监视高度多线程的通用图形处理器中的内存层次结构性能。所提出的分析基于为应用程序执行的小快照收集的内存跟踪。在高度多线程执行环境中,在存在非均匀线程交织和数据共享的情况下,采用蒙特卡洛实验方法的基于跟踪的内存层次模型会生成性能度量的统计范围。统计方法克服了传统的因仪器执行时间受干扰的问题。当我部署最小采样技术以减少跟踪生成开销和模型仿真时间时,该方法可以很好地进行扩展。所提出的方案还可以跟踪源代码中的各个内存操作,并可以量化它们对内存系统的有害影响的数量性能。对模型结果的交叉验证表明,这些结果与从NVIDIA Tesla C2050的硬件性能计数器读取的值非常一致。稍后,我在分析模型中使用预测的内存层次结构性能统计信息来确定内核的性能特征及其预期的执行时间。为了解决预测中存在的系统误差,我对误差函数进行了近似,并为每个预测值表示了潜在的真实执行时间范围。
展开▼