首页> 外文期刊>Journal of Parallel and Distributed Computing >Feedback-directed page placement for ccNUMA via hardware-generated memory traces
【24h】

Feedback-directed page placement for ccNUMA via hardware-generated memory traces

机译:ccNUMA通过硬件生成的内存跟踪进行反馈控制的页面放置

获取原文
获取原文并翻译 | 示例

摘要

Non-uniform memory architectures with cache coherence (ccNUMA) are becoming increasingly common, not just for large-scale high performance platforms but also in the context of multi-core architectures. Under ccNUMA, data placement may influence overall application performance significantly as references resolved locally to a processor/core impose lower latencies than remote ones. This work develops a novel hardware-assisted page placement paradigm based on automated tracing of the memory references made by application threads. Two placement schemes, modeling both single-level and multi-level latencies, allocate pages near processors that most frequently access that memory page. These schemes leverage performance monitoring capabilities of contemporary microprocessors to efficiently extract an approximate trace of memory accesses. This information is used to decide page affinity, i.e., the node to which the page is bound. The method operates entirely in user space, is widely automated, and handles not only static but also dynamic memory allocation. Experiments show that this method, although based on lossy tracing, can efficiently and effectively improve page placement, leading to an average wall-dock execution time saving of over 20% for the tested benchmarks on the SGI Altix with a 2x remote access penalty and 12% on AMD Opterons with a 1.3-2.0x access penalty. This is accompanied by a one-time tracing overhead of 2.7% over the overall original program walldock time.
机译:具有高速缓存一致性(ccNUMA)的非统一内存体系结构正变得越来越普遍,不仅适用于大规模高性能平台,而且在多核体系结构中也是如此。在ccNUMA下,数据放置可能会显着影响整体应用程序性能,因为本地解析到处理器/核心的引用所带来的等待时间要比远程解决方案的等待时间低。这项工作基于对应用程序线程进行的内存引用的自动跟踪,开发了一种新颖的硬件辅助页面放置范例。可以模拟单级和多级延迟的两种布局方案将页面分配到处理器附近,这些处理器最常访问该内存页面。这些方案利用当代微处理器的性能监视功能来有效地提取存储器访问的大概轨迹。该信息用于确定页面亲和力,即页面绑定到的节点。该方法完全在用户空间中操作,广泛自动化,并且不仅处理静态内存,还处理动态内存分配。实验表明,该方法虽然基于有损跟踪,但可以有效地改善页面放置,从而使SGI Altix上测试基准测试的墙式坞站平均执行时间节省了20%以上,而远程访问罚款是2倍, %的AMD皓龙处理器,访问损失为1.3-2.0倍。伴随着整个原始程序walldock时间的2.​​7%的一次性跟踪开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号