FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Zhao Kai; Di Sheng; Li Sihuan; Liang Xin; Zhai Yujia; Chen Jieyang; Ouyang Kaiming; Cappello Franck; Chen Zizhong

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

【24h】

FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

机译：FT-CNN：卷积神经网络的基于算法的容错

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%similar to 8% in both error-free and error-injected situations).

机译：卷积神经网络（CNNS）对于解决许多领域的挑战和严重问题而言变得越来越重要。 CNN推断应用已部署在安全关键系统中，这可能患有由高能粒子，高温或异常电压引起的软误差。至关重要的重要性是确保CNN推理过程对软错误的稳定性。传统的容错方法不适用于CNN推断，因为纠错码无法保护计算分量，指令复制技术引起高开销，并且现有的基于算法的容错（ABFT）技术无法保护所有卷积实现。在本文中，我们专注于如何尽可能高效地保护CNN推理过程，并有以下三种贡献。（1）我们提出了基于校验和技术的几种系统ABFT方案，并彻底分析了其故障保护能力和运行时。与基于Matrix-Matrix乘法的传统ABFT不同，我们的方案支持任何卷积实现。（2）我们设计了一种新的工作流程，整合所有提出的方案，以获得具有有限的总运行时开销的高检测/校正能力。（3）我们使用具有众所周知的CNN模型的ImageNet进行评估，包括AlexNet，VGG-19，Reset-18和Yolov2。实验结果表明，我们的实现可以处理具有非常有限的运行时开销的软错误（在无差错和错误的情况下与8％相似的4％）。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |2021年第7期|1677-1689|共13页
作者
Zhao Kai; Di Sheng; Li Sihuan; Liang Xin; Zhai Yujia; Chen Jieyang; Ouyang Kaiming; Cappello Franck; Chen Zizhong;
展开▼
作者单位

Univ Calif Riverside Dept Comp Sci & Engn Riverside CA 92521 USA;

Argonne Natl Lab Math & Comp Sci Div Lemont IL 60439 USA;

Univ Calif Riverside Dept Comp Sci & Engn Riverside CA 92521 USA;

Oak Ridge Natl Lab Comp Sci & Math Div Oak Ridge TN 37831 USA;

Univ Calif Riverside Dept Comp Sci & Engn Riverside CA 92521 USA;

Oak Ridge Natl Lab Comp Sci & Math Div Oak Ridge TN 37831 USA;

Univ Calif Riverside Dept Comp Sci & Engn Riverside CA 92521 USA;

Argonne Natl Lab Math & Comp Sci Div Lemont IL 60439 USA;

Univ Calif Riverside Dept Comp Sci & Engn Riverside CA 92521 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Convolution; Runtime; Kernel; Fault tolerant systems; Fault tolerance; Error correction codes; Mathematical model; Algorithm-based fault tolerance; deep learning; silent data corruption; reliability; high-performance computing;

机译：卷积;运行时;内核;容错系统;容错;纠错码;数学模型;基于算法的容错;深度学习;无声数据损坏;可靠性;可靠性;可靠性;可靠性;可靠性;高性能;高性能计算;

相似文献

外文文献
中文文献
专利

1. Study of Fault Tolerance Methods for Hardware Implementations of Convolutional Neural Networks [J] . R. A. Solovyev, A. L. Stempkovsky, D. V. Telpukhov Optical memory & neural networks . 2019,第2期

机译：卷积神经网络硬件实现的容错方法研究
2. Algorithm-based fault tolerance for FFT networks [J] . Sying-Jyan Wang, Jha N.K. IEEE Transactions on Computers . 1994,第7期

机译：FFT网络基于算法的容错能力
3. Multiple Faults Diagnosis of Distribution Network Lines Based on Convolution Neural Network with Fuzzy Optimization [J] . Huanxin Guan, Bo Yang, Herong Wang, IAENG Internaitonal journal of computer science . 2020,第3PTa2期

机译：基于卷积神经网络与模糊优化的多重故障诊断
4. Classification of COVID-19 cases using Fine-Tune Convolution Neural Network (FT-CNN) [C] . Sheshang Degadwala, Dhairya Vyas, Harsh Dave International Conference on Artificial Intelligence and Smart Systems . 2021

机译：Covid-19使用微调卷积神经网络（FT-CNN）分类
5. System reliability through algorithm-based fault tolerance and reconfiguration. [D] . Ramanathan, Gowri. 1998

机译：通过基于算法的容错和重新配置来提高系统可靠性。
6. Bearing Fault Diagnosis with a Feature Fusion Method Based on an Ensemble Convolutional Neural Network and Deep Neural Network [O] . Hongmei Li, Jinying Huang, Shuwei Ji 2019

机译：基于集成卷积神经网络和深度神经网络的特征融合方法进行轴承故障诊断
7. Algorithm-based fault tolerance applied to P2P computing networks [O] . Roche, Thomas, Roch, Jean-Louis, Cunche, Mathieu 2009

机译：基于算法的容错应用于P2P计算网络
8. Fault Tolerance of Neural Networks [R] . 1994

机译：神经网络的容错性

FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅