...
首页> 外文期刊>Parallel and Distributed Systems, IEEE Transactions on >Improving the Reliability of MPI Libraries via Message Flow Checking
【24h】

Improving the Reliability of MPI Libraries via Message Flow Checking

机译:通过消息流检查提高MPI库的可靠性

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Despite the success of the Message Passing Interface (MPI), many MPI libraries have suffered from software bugs. These bugs severely impact the productivity of a large number of users, causing program failures or other errors. As a result, MPI application developers often have to spend days or weeks in vain debugging their own code. To address this daunting problem, this paper presents a new method called FlowChecker, which detects communication related bugs in MPI libraries. First, FlowChecker extracts program intentions of message passing (MP-intentions), which specify messages to be delivered from the sources to the destinations. Then FlowChecker tracks the message flows that actually occur in the underlying MPI libraries. Finally, FlowChecker checks whether the messages are correctly delivered from the sources to the destinations by comparing the message flows against the MP-intentions. If a mismatch is found, FlowChecker reports a bug and provides diagnostic information to help MPI library developers to understand and fix it. We have built a FlowChecker prototype on Linux and evaluated it with five real-world and two injected bug cases in three widely used MPI libraries, including Open MPI, MPICH2, and MVAPICH2. Our experimental results show that FlowChecker effectively detects all seven evaluated bug cases. Additionally, it provides useful diagnostic information for narrowing down or even pinpointing root causes of the bugs. Moreover, our experiments with High Performance Linpack and NAS Parallel Benchmarks show that FlowChecker induces low runtime overhead (0.9-5.6 percent on Open MPI, 0.9-8.1 percent on MPICH2, and 1.6-9.7 percent on MVAPICH2).
机译:尽管消息传递接口(MPI)取得了成功,但许多MPI库仍遭受软件错误的困扰。这些错误会严重影响大量用户的工作效率,从而导致程序失败或其他错误。结果,MPI应用程序开发人员通常不得不花费数天或数周时间来调试自己的代码。为了解决这个艰巨的问题,本文提出了一种称为FlowChecker的新方法,该方法可以检测MPI库中与通信相关的错误。首先,FlowChecker提取消息传递的程序意图(MP意图),它指定要从源传递到目的地的消息。然后,FlowChecker跟踪底层MPI库中实际发生的消息流。最后,FlowChecker通过将消息流与MP意图进行比较来检查消息是否已从源正确传递到目的地。如果发现不匹配,FlowChecker会报告错误并提供诊断信息,以帮助MPI库开发人员理解和修复它。我们在Linux上构建了FlowChecker原型,并在三个广泛使用的MPI库(包括Open MPI,MPICH2和MVAPICH2)中使用五个真实世界和两个注入的错误案例对其进行了评估。我们的实验结果表明,FlowChecker有效地检测了所有七个评估的错误案例。此外,它提供了有用的诊断信息,可用于缩小范围甚至查明错误的根本原因。此外,我们对高性能Linpack和NAS并行基准进行的实验表明,FlowChecker的运行时开销较低(Open MPI为0.9-5.6%,MPICH2为0.9-8.1%,MVAPICH2为1.6-9.7%)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号