首页> 外文会议>ACM SIGPLAN symposium on Principles and practice of parallel programming >Hardware profile-guided automatic page placement for ccNUMA systems
【24h】

Hardware profile-guided automatic page placement for ccNUMA systems

机译:CCNUMA系统的硬件配置文件引导自动页面放置

获取原文

摘要

Cache coherent non-uniform memory architectures (ccNUMA) constitute an important class of high-performance computing plat-forms. Contemporary ccNUMA systems, such as the SGI Altix, have a large number of nodes, where each node consists of a small number of processors and a fixed amount of physical memory. All processors in the system access the same global virtual address space but the physical memory is distributed across nodes, and coherence is maintained using hardware mechanisms. Accesses to local physical memory (on the same node as the requesting processor) results in lower latencies than accesses to remote memory (on a different node). Since many scientific programs are memory-bound, an intelligent page-placement policy that allocates pages closer to the requesting processor can significantly reduce number of cycles required to access memory. We show that such a policy can lead to significant savings in wall-clock execution time.In this paper, we introduce a novel hardware-assisted page placement scheme based on automated profiling. The placement scheme allocates pages near processors that most frequently access that page. The scheme leverages performance monitoring capabilities of contemporary microprocessors to efficiently extract an approximate trace of memory accesses. This information is used to decide page affinity, i.e., the node to which the page is bound. Our method operates entirely in user space, is widely automated, and handles not only static but also dynamic memory allocation.We evaluate our framework with a set of multi-threaded benchmarks from the NAS and SPEC OpenMP suites. We investigate the use of two different hardware profile sources with respect to the cost (e.g., time to trace, number of records in profile) vs. the accuracy of the profile and the corresponding savings in wall-clock execution time. We show that long-latency loads provide a better indicator for page placement than TLB misses.Our experiments show that our method can efficiently improve page placement, leading to an average wall-clock execution time saving of more than 20% for our benchmarks, with a one-time profiling overhead of 2.7% over the overall original program wallclock time. To the best of our knowledge, this is the first evaluation on a real machine of a completely user mode interrupt-driven profile-guided page placement scheme that requires no special compiler, operating system or network interconnect support.
机译:高速缓存相干非统一内存架构(CCNUMA)构成了一类重要的高性能计算平面形式。当代CCNUMA系统,如SGI Altix,具有大量节点,其中每个节点包括少量处理器和固定数量的物理内存。系统中的所有处理器访问相同的全局虚拟地址空间,但物理内存分布在节点上,并且使用硬件机制保持一致性。访问本地物理内存(在与请求处理器的同一节点上)导致较低的延迟,而不是对远程存储器的访问(在不同节点上)。由于许多科学程序是内存绑定的,因此智能页面放置策略,其分配靠近请求处理器的页面可以显着减少访问存储器所需的周期数。我们表明,此类政策可能导致壁钟执行时间的大量节省。在本文中,我们介绍了一种基于自动分析的新型硬件辅助页面放置方案。 Placement方案分配了最常用该页面的处理器附近的页面。该方案利用当代微处理器的性能监控能力,以有效提取近似的存储器访问轨迹。此信息用于决定页面亲和力,即。,页面绑定的节点。我们的方法完全在用户空间中运行,广泛自动化,不仅处理静态而且处理动态内存分配。我们使用来自NAS和Spec OpenMP套件的一组多线程基准测试来评估我们的框架。我们研究了两个不同的硬件轮廓来源相对于成本(例如,追踪时间的时间,轮廓中的记录数)与墙钟执行时间的简要准确性和相应的节省。我们表明,长期负荷提供比TLB未命中的页面展示更好的指标。我们的实验表明,我们的方法可以有效地改善页面放置,导致我们的基准的平均挂钟执行时间为20%以上,为我们的基准节省超过20%在整体原始程序壁克隆时间内一次性分析开销2.7%。据我们所知,这是对一个完全用户模式中断驱动的轮廓引导页面放置方案的真实机器的第一个评估,该方案不需要特殊编译器,操作系统或网络互连支持。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号