首页> 外文学位 >Locality Transformations and Prediction Techniques for Optimizing Multicore Memory Performance.

【24h】

Locality Transformations and Prediction Techniques for Optimizing Multicore Memory Performance.

机译：用于优化多核内存性能的局部性转换和预测技术。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Chip Multiprocessors (CMPs) are here to stay for the foreseeable future. In terms of programmability of these processors what is different from legacy multiprocessors is that sharing among the different cores (processors) is less expensive than it was in the past. Previous research suggested that sharing is a desirable feature to be incorporated in new codes. For some programs, more cache leads to more beneficial sharing since the sharing starts to kick in for large on chip caches. This work tries to answer the question of whether or not we can (should) write code differently when the underlying chip microarchitecture is powered by a Chip Multiprocessor. We use a set three graph benchmarks each with three different input problems varying in size and connectivity to characterize the importance of how we partition the problem space among cores and how that partitioning can happen at multiple levels of the cache leading to better performance because of good utilization of the caches at the lowest level and because of the increased sharing of data items that can be boosted at the shared cache level (L2 in our case) which can effectively be a prefetching effect among different compute cores.;The thesis has two thrusts. The first is exploring the design space represented by different parallelization schemes (we devise some tweaks on top of existing techniques) and different graph partitionings (a locality optimization techniques suited for graph problems). The combination of the parallelization strategy and graph partitioning provides a large and complex space that we characterize using detailed simulation results to see how much gain we can obtain over a baseline legacy parallelization technique with a partition sized to fit in the L1 cache. We show that the legacy parallelization is not the best alternative in most of the cases and other parallelization techniques perform better. Also, we show that there is a search problem to determine the partitioning size and in most of the cases the best partitioning size is smaller than the baseline partition size.;The second thrust of the thesis is exploring how we can predict the best combination of parallelization and partitioning that performs the best for any given benchmark under any given input data set. We use a PIN based reuse distance profile computation tool to build an execution time prediction model that can rank order the different combinations of parallelization strategies and partitioning sizes. We report the amount of gain that we can capture using the PIN prediction relative to what detailed simulation results deem the best under a given benchmark and input size. In some cases the prediction is 100% accurate and in some other cases the prediction projects worse performance than the baseline case. We report the difference between the simulation best performing combination and the PIN predicted ones as well as other statistics to evaluate how good the predictions are. We show that the PIN prediction method performs very well in predicting the partition size compared to predicting the parallelization strategy. In this case, the accuracy of the overall scheme can be highly improved if we only use the partitioning size predicted by the PIN prediction scheme and then we use a search strategy to find the best parallelization strategy for that partition size.;In this thesis, we use a detailed performance model to scan a large solution space for the best parameters for locality optimization of a set of graph problems. Using the M5 performance simulation we show gains of up to 20% vs. a naively picked baseline case. Our prediction scheme can achieve up to 100% of the best performance gains obtained using a search method on real hardware or performance simulation without running at all on the target hardware and up to 48% on average across all of our benchmarks and input sizes.;There are several interesting aspects to this work. We are the first to devise and verify a performance model against detailed simulation results. We suggest and quantify that locality optimization and problem partitioning can increase sharing synergistically to achieve better performance overall. We have shown a new utilization for coherent reuse distance profiles as a helping tool for program developers and compilers to a optimize program's performance.

机译：芯片多处理器（CMP）在可预见的未来将继续存在。在这些处理器的可编程性方面，与传统的多处理器不同的是，不同内核（处理器）之间的共享比过去便宜。先前的研究表明，共享是要纳入新代码中的理想功能。对于某些程序，更多的缓存会导致更有益的共享，因为对于大型片上缓存而言，共享开始发挥作用。这项工作试图回答以下问题：当底层芯片微体系结构由芯片多处理器支持时，我们是否可以（应该）编写不同的代码。我们使用一组三个图形基准，每个基准具有三个不同的输入问题，这些输入问题的大小和连接性各不相同，以表征我们如何在内核之间划分问题空间以及如何在高速缓存的多个级别上进行划分，以达到更好的性能的重要性。最低级别的缓存利用率，以及由于可以在共享缓存级别（本例中为L2）提高数据项的共享程度，这可以有效地成为不同计算核心之间的预取效果。。首先是探索由不同的并行化方案（我们在现有技术的基础上进行一些调整）和不同的图分区（适用于图问题的局部性优化技术）代表的设计空间。并行化策略和图分区的组合提供了一个庞大而复杂的空间，我们使用详细的仿真结果来表征该空间，以查看我们可以通过使用分区大小适合L1缓存的基线传统并行化技术获得多少收益。我们表明，在大多数情况下，传统并行化并不是最佳选择，其他并行化技术的性能更好。此外，我们还表明存在一个确定分区大小的搜索问题，并且在大多数情况下，最佳分区大小小于基准分区大小。；本论文的第二个重点是探索如何才能预测分区的最佳组合在任何给定的输入数据集下，对任何给定的基准执行最佳性能的并行化和分区。我们使用基于PIN的重用距离配置文件计算工具来构建执行时间预测模型，该模型可以对并行化策略和分区大小的不同组合进行排序。我们报告了在特定的基准和输入大小下，相对于详细的模拟结果认为最佳的结果，可以使用PIN预测捕获的增益量。在某些情况下，预测是100％准确的，而在另一些情况下，预测所预测的性能将比基准情况差。我们报告了模拟效果最佳的组合与PIN预测的组合以及其他统计数据之间的差异，以评估预测的效果。我们证明，与预测并行化策略相比，PIN预测方法在预测分区大小方面表现非常出色。在这种情况下，如果仅使用PIN预测方案预测的分区大小，然后使用搜索策略找到该分区大小的最佳并行化策略，则可以大大提高整体方案的准确性。我们使用详细的性能模型来扫描较大的解决方案空间，以获取最佳参数，以优化一组图形问题的位置。使用M5性能仿真，与单纯选择基线情况相比，我们显示出高达20％的增益。我们的预测方案可以在真正的硬件或性能模拟上使用搜索方法获得的最佳性能增益最多可达到100％，而无需在目标硬件上完全运行，而在所有基准和输入大小上平均可以达到48％。这项工作有几个有趣的方面。我们是第一个根据详细的仿真结果设计和验证性能模型的公司。我们建议并量化，局部性优化和问题分区可以协同增加共享，从而总体上获得更好的性能。我们已经展示了相干重用距离配置文件的一种新用途，它可以作为程序开发人员和编译器优化程序性能的帮助工具。

著录项

作者
Badawy, Abdel-Hameed A.;
展开▼
作者单位

University of Maryland, College Park.;

展开▼
授予单位 University of Maryland, College Park.;
学科 Engineering Computer.;Computer Science.
学位 Ph.D.
年度 2013
页码 219 p.
总页数 219
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Memory Allocation Exploiting Temporal Locality for Reducing Data-Transfer Bottlenecks in Heterogeneous Multicore Processors [J] . Waidyasooriya H. M., Ohbayashi Y., Hariyama M., Circuits and Systems for Video Technology, IEEE Transactions on . 2011,第10期

机译：内存分配利用时间局部性来减少异构多核处理器中的数据传输瓶颈
2. Management and Optimization for Nonvolatile Memory-Based Hybrid Scratchpad Memory on Multicore Embedded Processors [J] . Hu Jingtong, Zhuge Qingfeng, Xue Chun Jason, ACM Transactions on Embedded Computing Systems . 2014,第4期

机译：多核嵌入式处理器上基于非易失性存储器的混合Scratchpad存储器的管理和优化
3. Efficient Code Assignment Techniques for Local Memory on Software Managed Multicores [J] . Lu Jing, Bai Ke, Shrivastava Aviral ACM Transactions on Embedded Computing Systems . 2015,第4期

机译：软件托管多核上本地内存的高效代码分配技术
4. A Data Transformations Based Approach for Optimizing Memory and Cache Locality on Distributed Memory Multiprocessors [C] . Xia Jun, Xue-Jun Yang International Workshop on Advanced Parallel Processing Technologies(APPT 2005); 20051027-28; Hong Kong(CN) . 2005

机译：基于数据转换的分布式内存多处理器内存和缓存局部性优化方法
5. Nest -loop transformation techniques considering timing and memory optimization for embedded systems [D] . Liu, Meilin 2006

机译：考虑嵌入式系统时序和内存优化的嵌套循环转换技术
6. Optimizing an Adaptive Neuro-Fuzzy Inference System for Spatial Prediction of Landslide Susceptibility Using Four State-of-the-art Metaheuristic Techniques [O] . Mohammad Mehrabi, Biswajeet Pradhan, Hossein Moayedi, 2020

机译：使用四种最新的元启发式技术优化用于滑坡易感性空间预测的自适应神经模糊推理系统
7. Automatic Memory Layout Transformations to Optimize Spatial Locality in Parameterized Loop Nests [O] . Philippe Clauss, Benoît Meister 2000

机译：自动内存布局转换以优化参数化循环嵌套中的空间局部性

Locality Transformations and Prediction Techniques for Optimizing Multicore Memory Performance.

摘要

著录项

相似文献

相关主题

期刊订阅