首页> 外文期刊>Future generation computer systems >Error detection in large-scale parallel programs with long runtimes
【24h】

Error detection in large-scale parallel programs with long runtimes

机译:长时间运行的大型并行程序中的错误检测

获取原文
获取原文并翻译 | 示例
           

摘要

Error detection is an important activity of program development, which is applied to detect incorrect computations or runtime failures of software. The costs of debugging are strongly related to the complexity and the scale of the investigated programs. Both characteristics are especially cumbersome for large-scale parallel programs with long runtimes, which are quite common in computational science and engineering (CSE) applications. A solution is offered by a combination of techniques using the event graph model as a representation of parallel program behaviour. With process isolation, a subset of the original number of processes can be investigated, while the absent processes are simulated by the debugging system. With checkpointing, an arbitrary temporal section of a program's runtime can be extracted for exhaustive analysis without the need to restart the program from the beginning. Additional benefits of the event graph are support of equivalent execution of nondeterministic programs, as well as a comprehensible visualisation as a space―time diagram.
机译:错误检测是程序开发的重要活动,它用于检测不正确的计算或软件的运行时故障。调试的成本与所研究程序的复杂性和规模密切相关。对于长时间运行的大规模并行程序而言,这两个特性特别麻烦,而这在计算科学与工程(CSE)应用程序中非常常见。通过使用事件图模型作为并行程序行为的表示的多种技术的组合来提供解决方案。通过进程隔离,可以调查原始数量的进程的子集,而缺少的进程则由调试系统进行模拟。通过检查点,可以提取程序运行时的任意时间部分以进行详尽的分析,而无需从头开始重新启动程序。事件图的其他好处是支持非确定性程序的等效执行,以及作为时空图的可理解可视化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号