...
首页> 外文期刊>ACM Transactions on Modeling and Performance Evaluation of Computing Systems >Production Application Performance Data Streaming for System Monitoring
【24h】

Production Application Performance Data Streaming for System Monitoring

机译:生产应用程序性能数据流,用于系统监控

获取原文
获取原文并翻译 | 示例

摘要

In this article, we present an approach to streaming collection of application performance data. Practical application performance tuning and troubleshooting in production high-performance computing (HPC) environments requires an understanding of how applications interact with the platform, including (but not limited to) parallel programming libraries such as Message Passing Interface (MPI). Several profiling and tracing tools exist that collect heavy runtime data traces either in memory (released only at application exit) or on a file system (imposing an I/O load that may interfere with the performance being measured). Although these approaches are beneficial in development stages and post-run analysis, a systemwide and low-overhead method is required to monitor deployed applications continuously. This method must be able to collect information at both the application and system levels to yield a complete performance picture.In our approach, an application profiler collects application event counters. A sampler uses an efficient inter-process communication method to periodically extract the application counters and stream them into an infrastructure for performance data collection. We implement a tool-set based on our approach and integrate it with the Lightweight Distributed Metric Service (LDMS) system, a monitoring system used on large-scale computational platforms. LDMS provides the infrastructure to create and gather streams of performance data in a low overhead manner. We demonstrate our approach using applications implemented with MPI, as it is one of the most common standards for the development of large-scale scientific applications.We utilize our tool-set to study the impact of our approach on an open source HPC application, Nalu. Our tool-set enables us to efficiently identify patterns in the behavior of the application without source-level knowledge. We leverage LDMS to collect system-level performance data and explore the correlation between the system and application events. Also, we demonstrate how our tool-set can help detect anomalies with a low latency. We run tests on two different architectures: a system enabled with Intel Xeon Phi and another system equipped with Intel Xeon processor. Our overhead study shows our method imposes at most 0.5% CPU usage overhead on the application in realistic deployment scenarios.
机译:在本文中,我们提出了一种流式收集应用程序性能数据的方法。生产高性能计算(HPC)环境中的实际应用程序性能调整和故障排除需要了解应用程序如何与平台交互,包括(但不限于)并行编程库,例如消息传递接口(MPI)。存在几种概要分​​析和跟踪工具,它们可以在内存(仅在应用程序出口处释放)或文件系统(强加可能会影响所测性能的I / O负载)中收集大量的运行时数据跟踪。尽管这些方法在开发阶段和运行后分析中很有用,但仍需要一种系统范围内且开销较低的方法来连续监视已部署的应用程序。此方法必须能够在应用程序和系统级别上收集信息,以产生完整的性能图。在我们的方法中,应用程序分析器收集应用程序事件计数器。采样器使用高效的进程间通信方法来定期提取应用程序计数器,并将它们流式传输到用于性能数据收集的基础结构中。我们根据我们的方法实施工具集,并将其与轻量级分布式度量服务(LDMS)系统集成,该系统是用于大型计算平台的监视系统。 LDMS提供了以低开销的方式创建和收集性能数据流的基础结构。我们使用MPI实现的应用程序演示了我们的方法,因为它是开发大型科学应用程序的最常见标准之一。我们利用工具集来研究我们的方法对开源HPC应用程序Nalu的影响。我们的工具集使我们能够在没有源级知识的情况下有效地识别应用程序行为中的模式。我们利用LDMS来收集系统级性能数据,并探索系统与应用程序事件之间的相关性。此外,我们演示了我们的工具集如何帮助以低延迟检测异常。我们在两种不同的体系结构上运行测试:一个启用了Intel Xeon Phi的系统和另一个配备了Intel Xeon处理器的系统。我们的开销研究表明,在实际的部署方案中,我们的方法在应用程序上最多施加0.5%的CPU使用率开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号