Comparative Study of Distributed Deep Learning Tools on Supercomputers

机译：超级计算机上分布式深度学习工具的比较研究

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

With the growth of the scale of data set and neural networks, the training time is increasing rapidly. Distributed parallel training has been proposed to accelerate deep neural network training, and most efforts are made on top of GPU clusters. This paper focuses on the performance of distributed parallel training in CPU clusters of supercomputer systems. Using resources at the supercomputer system of "Tianhe-2", we conduct extensive evaluation of the performance of popular deep learning tools, including Caffe, TensorFlow, and BigDL, and several deep neural network models are tested, including Auto-Encoder, LeNet, AlexNet and ResNet. The experiment results show that Caffe performs the best in communication efficiency and scalability. BigDL is the fastest in computing speed benefiting from its optimization for CPU, but it suffers from long communication delay due to the dependency on MapReduce framework. The insights and conclusions from our evaluation provides significant reference for improving resource utility of supercomputer resources in distributed deep learning.

机译：随着数据集和神经网络规模的增长，训练时间正在迅速增加。已经提出了分布式并行训练来加速深度神经网络训练，并且大多数工作都在GPU集群之上进行。本文着重研究超级计算机系统CPU集群中分布式并行训练的性能。利用“天河2号”超级计算机系统中的资源，我们对流行的深度学习工具（包括Caffe，TensorFlow和BigDL）的性能进行了广泛的评估，并测试了多种深度神经网络模型，包括自动编码器，LeNet， AlexNet和ResNet。实验结果表明，Caffe在通信效率和可伸缩性方面表现最佳。 BigDL得益于其对CPU的优化，因此运算速度最快，但是由于依赖MapReduce框架，因此通信延迟长。我们评估的见解和结论为提高分布式深度学习中超级计算机资源的资源利用率提供了重要参考。

著录项

来源
《International conference on algorithms and architectures for parallel processing》|2018年|122-137|共16页
会议地点
作者
Xin Du; Di Kuang; Yan Ye; Xinxin Li; Mengqiang Chen; Yunfei Du; Weigang Wu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Distributed deep learning; Tianhe-2; Speedup; Performance evaluation; Parallel processing;

机译：分布式深度学习;天河2号;加速;绩效评估;并行处理;

相似文献

外文文献
中文文献
专利

1. Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools [J] . Mayer Ruben, Jacobsen Hans-Arno ACM Computing Surveys . 2021,第1期

机译：分布式基础设施的可扩展深度学习：挑战，技术和工具
2. A Comparative Study of Distributed Learning Environments on Learning Outcomes [J] . Maryam Alavi, George M. Marakas, Youngjin Yoo Information Systems Research . 2002,第4期

机译：分布式学习环境对学习成果的比较研究
3. Distributed computing as a virtual supercomputer: Tools to run and manage large-scale BOINC simulations [J] . Giorgino T., Harvey M.J., de Fabritiis G. Computer physics communications . 2010,第8期

机译：分布式计算作为虚拟超级计算机：运行和管理大规模BOINC模拟的工具
4. Comparative Study of Distributed Deep Learning Tools on Supercomputers [C] . Xin Du, Di Kuang, Yan Ye, International Conference on Algorithms and Architectures for Parallel Processing . 2018

机译：超级计算机上分布式深度学习工具的比较研究
5. A Comparative Study of Novel Deep Learning-Based and Conventional Atlas-Based Automatic Segmentation in Head and Neck Radiotherapy Planning [D] . Asbach, John. 2020

机译：基于深度学习和常规阿特拉斯的头部放射治疗规划中的基于常规地图的比较研究
6. Conventional Machine Learning versus Deep Learning for Magnification Dependent Histopathological Breast Cancer Image Classification: A Comparative Study with Visual Explanation [O] . Said Boumaraf, Xiabi Liu, Yuchai Wan, 2021

机译：传统机器学习与放大依赖性组织病理学乳腺癌图像分类的深度学习：视觉解释的比较研究
7. Blackboard as an online learning tool : comparative study of Blackboard to previous Blackboard versions and other online learning tools [O] . Singh Kanwarpreet 2015

机译：Blackboard作为在线学习工具：Blackboard与以前的Blackboard版本和其他在线学习工具的比较研究

Comparative Study of Distributed Deep Learning Tools on Supercomputers

摘要

著录项

相似文献

相关主题

期刊订阅