首页> 外文会议>International Symposium on High-Performance Computer Architecture >Improving the data cache performance of multiprocessor operating systems
【24h】

Improving the data cache performance of multiprocessor operating systems

机译:提高多处理器操作系统的数据缓存性能

获取原文

摘要

Bus-based shared-memory multiprocessors with coherent caches have recently become very popular. To achieve high performance, these systems rely on increasingly sophisticated cache hierarchies. However, while these machines often run loads with substantial operating system activity, performance measurements have consistently indicated that the operating system uses the data cache hierarchy poorly. In this paper, we address the issue of how to eliminate most of the data cache misses in a multiprocessor operating system while still using off-the-shelf processors. We use a performance monitor to examine traces of a 4-processor machine running four system-intensive loads under UNIX. Based on our observations, we propose hardware and software support that targets block operations, coherence activity, and cache conflicts. For block operations, simple cache bypassing or prefetching schemes are undesirable. Instead, it is best to use a DMA-like scheme that pipelines the data transfer in the bus without involving the processor. Coherence misses are handled with data, privatization and relocation, and the use of updates for a small core of shared variables. Finally, the remaining miss hot spots are handled with data prefetching. Overall, our simulations show that all these optimizations combined eliminate or hide 75% of the operating system data misses in 32-Kbyte primary caches. Furthermore, they speed up the operating system by 19%.
机译:基于总线的共享内存多处理器,具有连贯高速缓存最近变得非常受欢迎。为了实现高性能,这些系统依赖于越来越复杂的缓存层次结构。但是,虽然这些机器经常运行具有大量操作系统活动的负载,但是性能测量一致地表明操作系统使用数据缓存层次结构差。在本文中,我们解决了如何消除多处理器操作系统中的大多数数据缓存未命中的问题,同时仍在使用现成的处理器。我们使用性能监视器检查在UNIX下运行四个系统密集型负载的4处理器机器的痕迹。基于我们的观察,我们提出了针对块操作,一致性活动和缓存冲突的硬件和软件支持。对于块操作,简单的缓存绕过或预取方案是不可取的。相反,最好使用类似DMA的方案,该方案管道在公共汽车中提供数据传输而不涉及处理器。 Coherence Misses通过数据,私有化和重定位处理,以及使用共享变量的小核的使用。最后,剩下的错过的热点是处理数据预取的。总体而言,我们的模拟表明,所有这些优化组合都会消除或隐藏32-Kbyte Primary高速缓存中的75%的操作系统数据未命中。此外,它们将操作系统加速19%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号