Hardware profile-guided automatic page placement for ccNUMA systems

机译：CCNUMA系统的硬件配置文件引导自动页面放置

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Cache coherent non-uniform memory architectures (ccNUMA) constitute an important class of high-performance computing plat-forms. Contemporary ccNUMA systems, such as the SGI Altix, have a large number of nodes, where each node consists of a small number of processors and a fixed amount of physical memory. All processors in the system access the same global virtual address space but the physical memory is distributed across nodes, and coherence is maintained using hardware mechanisms. Accesses to local physical memory (on the same node as the requesting processor) results in lower latencies than accesses to remote memory (on a different node). Since many scientific programs are memory-bound, an intelligent page-placement policy that allocates pages closer to the requesting processor can significantly reduce number of cycles required to access memory. We show that such a policy can lead to significant savings in wall-clock execution time.In this paper, we introduce a novel hardware-assisted page placement scheme based on automated profiling. The placement scheme allocates pages near processors that most frequently access that page. The scheme leverages performance monitoring capabilities of contemporary microprocessors to efficiently extract an approximate trace of memory accesses. This information is used to decide page affinity, i.e., the node to which the page is bound. Our method operates entirely in user space, is widely automated, and handles not only static but also dynamic memory allocation.We evaluate our framework with a set of multi-threaded benchmarks from the NAS and SPEC OpenMP suites. We investigate the use of two different hardware profile sources with respect to the cost (e.g., time to trace, number of records in profile) vs. the accuracy of the profile and the corresponding savings in wall-clock execution time. We show that long-latency loads provide a better indicator for page placement than TLB misses.Our experiments show that our method can efficiently improve page placement, leading to an average wall-clock execution time saving of more than 20% for our benchmarks, with a one-time profiling overhead of 2.7% over the overall original program wallclock time. To the best of our knowledge, this is the first evaluation on a real machine of a completely user mode interrupt-driven profile-guided page placement scheme that requires no special compiler, operating system or network interconnect support.

机译：高速缓存相干非统一内存架构（CCNUMA）构成了一类重要的高性能计算平面形式。当代CCNUMA系统，如SGI Altix，具有大量节点，其中每个节点包括少量处理器和固定数量的物理内存。系统中的所有处理器访问相同的全局虚拟地址空间，但物理内存分布在节点上，并且使用硬件机制保持一致性。访问本地物理内存（在与请求处理器的同一节点上）导致较低的延迟，而不是对远程存储器的访问（在不同节点上）。由于许多科学程序是内存绑定的，因此智能页面放置策略，其分配靠近请求处理器的页面可以显着减少访问存储器所需的周期数。我们表明，此类政策可能导致壁钟执行时间的大量节省。在本文中，我们介绍了一种基于自动分析的新型硬件辅助页面放置方案。 Placement方案分配了最常用该页面的处理器附近的页面。该方案利用当代微处理器的性能监控能力，以有效提取近似的存储器访问轨迹。此信息用于决定页面亲和力，即。，页面绑定的节点。我们的方法完全在用户空间中运行，广泛自动化，不仅处理静态而且处理动态内存分配。我们使用来自NAS和Spec OpenMP套件的一组多线程基准测试来评估我们的框架。我们研究了两个不同的硬件轮廓来源相对于成本（例如，追踪时间的时间，轮廓中的记录数）与墙钟执行时间的简要准确性和相应的节省。我们表明，长期负荷提供比TLB未命中的页面展示更好的指标。我们的实验表明，我们的方法可以有效地改善页面放置，导致我们的基准的平均挂钟执行时间为20％以上，为我们的基准节省超过20％在整体原始程序壁克隆时间内一次性分析开销2.7％。据我们所知，这是对一个完全用户模式中断驱动的轮廓引导页面放置方案的真实机器的第一个评估，该方案不需要特殊编译器，操作系统或网络互连支持。

著录项

来源
《ACM SIGPLAN symposium on Principles and practice of parallel programming》|2006年||共10页
会议地点
作者
Jaydeep Marathe; Frank Mueller; PFrank Mueller;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序语言、算法语言;
关键词
profile-guided optimization;

机译：轮廓引导优化;

相似文献

外文文献
中文文献
专利

1. Feedback-directed page placement for ccNUMA via hardware-generated memory traces [J] . Jaydeep Marathe, Vivek Thakkar, Frank Mueller Journal of Parallel and Distributed Computing . 2010,第12期

机译：ccNUMA通过硬件生成的内存跟踪进行反馈控制的页面放置
2. Upgrading the Automatic Control Systems of T-250/300-240 Turbines at the TETs-23 Cogeneration Station Using Software-and-Hardware Systems [J] . D. E. Rozhkov, D. V. Nosikhin, M. V. Kolupaev, Thermal engineering . 2006,第11期

机译：使用软件和硬件系统升级TETs-23热电厂的T-250 / 300-240涡轮机自动控制系统
3. A novel hardware/software embedded system based on automatic censored target detection for radar systems [J] . Djemal R., Belwafi K., Kaaniche W., AEU: Archiv fur Elektronik und Ubertragungstechnik: Electronic and Communication . 2013,第4期

机译：基于雷达自动检查目标检测的新型硬件/软件嵌入式系统
4. Hardware profile-guided automatic page placement for ccNUMA systems [C] . Jaydeep Marathe, Frank Mueller, PFrank Mueller ACM SIGPLAN symposium on Principles and practice of parallel programming . 2006

机译：硬件配置文件引导的ccNUMA系统自动页面放置
5. Mitigating the impact of correlated hardware failure on data availability through survivable replica placement [D] . Mills, K. Alex 2013

机译：通过可生存的副本放置来减轻相关硬件故障对数据可用性的影响
6. Nasolacrimal Obstruction Following the Placement of Maxillofacial Hardware [O] . J. Minjy Kang, Evan Kalin-Hajdu, Oluwatobi O. Idowu, 2020

机译：颌面覆盖物硬件放置后的鼻升降梗阻
7. Assessment of operability software and hardware system of automatic control systems [O] . Голованов, С. А., Уманский, А. Б., Golovanov, S. A., 2016

机译：评估自动控制系统的软硬件系统
8. NASA MSFC hardware in the loop simulations of automatic rendezvous and capture systems [R] . Tobbe, Patrick A., Naumann, Charles B., Sutton, William, 1991

机译：Nasa msFC硬件在自动交会和捕获系统的循环模拟中

Hardware profile-guided automatic page placement for ccNUMA systems

摘要

著录项

相似文献

相关主题

期刊订阅