首页> 外文会议>IEEE International Conference on Cluster Computing >Democratizing Parallel Filesystem Monitoring
【24h】

Democratizing Parallel Filesystem Monitoring

机译:使并行文件系统监控民主化

获取原文

摘要

Parallel filesystems (PFSs) are one of the most critical high-availability components of High Performance Computing (HPC) systems. Most HPC workloads are dependent on the availability of a POSIX compliant parallel filesystem that provides a globally consistent view of data to all compute nodes of a HPC system. Because of this central role, failure or performance degradation events in the PFS can impact every user of a HPC resource. There is typically insufficient information available to users and even many HPC staff to identify the causes of these PFS events, impeding the implementation of timely and targeted remedies to PFS issues. The relevant information is distributed across PFS servers; however, access to these servers is highly restricted due to the sensitive role they play in the operations of a HPC system. Additionally, the information is challenging to aggregate and interpret, relegating diagnosis and treatment of PFS issues to a select few experts with privileged system access. To democratize this information, we are developing an open-source and user-facing Parallel FileSystem TRacing and Analysis SErvice (PFSTRASE) that analyzes the requisite data to establish causal relationships between PFS activity and events detrimental to stability and performance. We are implementing the service for the open-source Lustre filesystem, which is the most commonly used PFS at large-scale HPC sites. Server loads for specific PFS I/O operations (IOPs) will be measured and aggregated by the service to automatically estimate an effective load generated by every client, job, and user. The infrastructure provides a realtime, user accessible text-based interface and a publicly accessible web interface displaying both real-time and historical data. To democratize this information, we are developing an open-source and user-facing Parallel FileSystem TRacing and Analysis SErvice (PFSTRASE) that analyzes the requisite data to establish causal relationships between PFS activity and events detrimental to stability and performance. We are implementing the service for the open-source Lustre filesystem, which is the most commonly used PFS at large-scale HPC sites. Server loads for specific PFS I/O operations (IOPs) will be measured and aggregated by the service to automatically estimate an effective load generated by every client, job, and user. The infrastructure provides a realtime, user accessible text-based interface and a publicly accessible web interface displaying both real-time and historical data.
机译:并行文件系统(PFS)是高性能计算(HPC)系统最关键的高可用性组件之一。大多数HPC工作负载取决于POSIX兼容并行文件系统的可用性,该并行文件系统向HPC系统的所有计算节点提供全局一致的数据视图。由于此核心作用,PFS中的故障或性能下降事件可能会影响HPC资源的每个用户。用户甚至许多HPC工作人员通常都没有足够的信息来确定这些PFS事件的原因,从而阻碍了针对PFS问题的及时,有针对性的补救措施的实施。相关信息分布在PFS服务器之间;但是,由于这些服务器在HPC系统的运行中起着敏感的作用,因此对它们的访问受到严格限制。此外,这些信息很难汇总和解释,将PFS问题的诊断和处理工作委派给具有特权系统访问权限的少数专家。为了使这些信息民主化,我们正在开发一个开源的,面向用户的并行文件系统跟踪和分析服务(PFSTRASE),该服务分析必要的数据以建立PFS活动和事件之间的因果关系,从而对稳定性和性能造成不利影响。我们正在为开源Lustre文件系统实现服务,该文件系统是大型HPC站点中最常用的PFS。服务将对特定PFS I / O操作(IOP)的服务器负载进行衡量和汇总,以自动估计每个客户端,作业和用户所产生的有效负载。该基础结构提供了一个实时的,用户可访问的基于文本的界面以及一个可公开访问的同时显示实时和历史数据的Web界面。为了使这些信息民主化,我们正在开发一个开源的,面向用户的并行文件系统跟踪和分析服务(PFSTRASE),该服务分析必要的数据以建立PFS活动和事件之间的因果关系,从而对稳定性和性能造成不利影响。我们正在为开源Lustre文件系统实现服务,该文件系统是大型HPC站点中最常用的PFS。服务将对特定PFS I / O操作(IOP)的服务器负载进行衡量和汇总,以自动估计每个客户端,作业和用户所产生的有效负载。该基础结构提供了一个实时的,用户可访问的基于文本的界面以及一个可公开访问的同时显示实时和历史数据的Web界面。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号