首页> 美国政府科技报告 >Coordinated Fault Tolerance for High Performance, Final Project Report.

【24h】

Coordinated Fault Tolerance for High Performance, Final Project Report.

机译：高性能，最终项目报告的协调容错。

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The main purpose of the Center for the Improvement of Fault Tolerance in Systems has been to conduct research with a goal of providing end-to-end fault tolerance on a systemwide basis for applications and other system software. While fault tolerance has been an integral part of most high-end computing (HEC) system software developed over the past decade, it has been treated mostly as a collection of isolated stovepipes. Visibility and response to faults has typically been limited to the particular hardware and software subsystems in which they are initially observed. Little fault information is shared across subsystems, allowing little flexibility or control on a system-wide basis, making it practically impossible to provide cohesive end-to-end fault tolerance in support of scientific applications. As an example, consider faults such as communication link failures that can be seen by a middleware or network library but are not directly visible to the job scheduler, or consider faults related to node failures that can be detected by system monitoring software but are not inherently visible to the resource manager. If information about such faults could be shared by the middleware/network libraries or monitoring software, then other system software, such as a resource manager or job scheduler, could ensure that failed nodes or failed network links were excluded from further job allocations and that further diagnosis could be performed.

著录项

作者
Beckman, P.; Stevens, R. L.;
展开▼
作者单位

展开▼
年度 2011
页码 1-35
总页数 35
原文格式 PDF
正文语种 eng
中图分类工业技术;
关键词
Fault tolerant computers; Communication links; Computer hardware; Computer networks; Computer software; Monitoring; High performance computing;

机译：容错计算机;通信链路;计算机硬件;计算机网络;计算机软件;监控;高性能计算;

相似文献

外文文献
中文文献
专利

1. Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints [J] . Wong Alvaro, Heymann Elisa, Rexachs Dolores, IEEE Transactions on Parallel and Distributed Systems . 2021,第2期

机译：使用半协调检查点管理故障容错的中间件
2. A Distributed Fault Tolerance Global Coordinator Election Algorithm in Unreliable High Traffic Distributed Systems [J] . Danial Rahdari, Amir Masoud Rahmani, Niusha Aboutaleby, International Journal of Information Technology and Computer Science . 2015,第3期

机译：不可靠高流量分布式系统中的分布式容错全局协调器选择算法
3. Management of Fault Tolerance Information for Coordinated Checkpointing Protocol without Sympathetic Rollbacks [J] . KWANG SIK CHUNG, YOUNGJUN LEE, HEONCHANG YU, Journal of information science and engineering . 2004,第2期

机译：无同情回滚的协调检查点协议的容错信息管理
4. Final year research project course for engineering: course coordinators reflection [C] . Abhijit Date, Baljit Singh, Muhammad Fairuz Remeli IEEE International Conference on Engineering Education . 2018

机译：工程学的最后一年研究项目课程：课程协调员的思考
5. Effective construction schedule management: Construction project monitoring with Project Performance Indicators & the Project Status Report. [D] . Totin, Christopher Michael. 2012

机译：有效的施工进度管理：使用项目绩效指标和项目状态报告监控施工项目。
6. Health services research as a scientific process: the metamorphosis of an empirical research project from grant proposal to final report. [O] . H S Luft 1986

机译：卫生服务研究是一个科学过程：从赠款申请到最终报告的实证研究项目的变态。
7. Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report [O] . Lumsdaine, Andrew 2013

机译：ER25750奖：容错系统的协调基础设施印第安纳大学最终报告

Coordinated Fault Tolerance for High Performance, Final Project Report.

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅