首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Performance Modeling of Parallel Loops on Multi-Socket Platforms Using Queueing Systems
【24h】

Performance Modeling of Parallel Loops on Multi-Socket Platforms Using Queueing Systems

机译:使用排队系统的多套接字平台上并行循环的性能建模

获取原文
获取原文并翻译 | 示例

摘要

Predicting the performance of parallel loops on modern shared-memory multi-socket multi-core systems in dependence of the allocated resources is an important means to achieve better system utilization. Previous prediction techniques are tied to specific architectures and do not allow for purely online performance predictions without requiring an offline analysis of the parallel program. This paper presents a practical approach based on queueing theory to model the performance of parallel programs in dependence of the number of allocated core resources. Based on the key insight that scalability of scientific parallel loops is limited by memory performance, a hierarchically constructed M/M/1/N/N queue system is used to analytically compute the response time at the different congestion points in the memory system of modern NUMA architectures. After automatically tuning the model to a specific architecture by executing a number of micro-benchmarks, the required parameter values are obtained at runtime from hardware performance counters present in modern commodity AMD and Intel processors. Evaluated with 24 OpenMP parallel loops on a 64-core AMD and a 72-core Intel multi-socket platform, the presented queueing system is able to accurately predict the speedup of parallel loops with a mean absolute percentage error of 8.3 percent on the AMD system and 6.7 percent on the Intel platform.
机译:依靠分配的资源来预测现代共享内存多路多核系统上并行循环的性能是实现更好的系统利用率的重要手段。先前的预测技术与特定体系结构相关联,并且在不要求对并行程序进行脱机分析的情况下,无法进行纯粹的在线性能预测。本文提出了一种基于排队论的实用方法,可根据分配的核心资源的数量对并行程序的性能进行建模。基于科学并行循环的可扩展性受内存性能限制的关键见解,使用分层构造的M / M / 1 / N / N队列系统来分析计算现代内存系统中不同拥塞点的响应时间NUMA体系结构。通过执行许多微基准测试将模型自动调整为特定的体系结构后,可以在运行时从现代商品AMD和Intel处理器中存在的硬件性能计数器获得所需的参数值。通过在64核AMD和72核Intel多路平台上使用24个OpenMP并行循环进行评估,该排队系统能够准确预测并行循环的速度,在AMD系统上的平均绝对百分比误差为8.3%在英特尔平台上为6.7%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号