Practical Efficiency of Asynchronous Stochastic Gradient Descent

机译：异步随机梯度下降的实际效率

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Stochastic gradient descent (SGD) and its distributed variants are essential to leverage modern computing resources for large-scale machine learning tasks. ASGD [1] is one of the most popular asynchronous distributed variant of SGD. Recent mathematical analyses have shown that with certain assumptions on the learning task (and ignoring communication cost), ASGD exhibits linear speed-up asymptotically. However, as practically observed, ASGD does not lead linear speed-up as we increase the number of learners. Motivated by this, we investigate finite time convergence properties of ASGD. We observe that the learning rate used by mathematical analyses to guarantee linear speed-up can be very small (and practically sub-optimal with respect to convergence speed) as opposed to practically chosen learning rates (for quick convergence) which exhibit sub-linear speed-up. We show that such an observation can in fact be supported by mathematical analysis, i.e., in the finite time regime, better convergence rate guarantees can be proven for ASGD with small number of learners, thus indicating lack of linear speed up as we increase the number of learners. Thus we conclude that even with ignoring communication cost, there is an inherent inefficiency in ASGD with respect to increasing the number of learners.

机译：随机梯度下降（SGD）及其分布式变体对于将现代计算资源用于大规模机器学习任务至关重要。 ASGD [1]是SGD最受欢迎的异步分布式变体之一。最近的数学分析表明，在对学习任务有一定假设的情况下（并且忽略了通信成本），ASGD渐近呈现线性加速。但是，正如实际观察到的那样，随着我们增加学习者的数量，ASGD不会导致线性加速。因此，我们研究了ASGD的有限时间收敛性质。我们观察到，数学分析用于保证线性加速的学习速率可能很小（相对于收敛速度而言实际上是次优的），而实际选择的学习速度（对于快速收敛而言）则表现出次线性速度-向上。我们表明，这种观察实际上可以得到数学分析的支持，即在有限的时间范围内，对于学习者数量较少的ASGD，可以证明更好的收敛速度保证，从而表明随着我们增加学习者数量，线性加速的速度不足学习者。因此，我们得出结论，即使忽略通信成本，在增加学习者数量方面，ASGD也存在固有的效率低下的问题。

著录项

来源
《2016 2nd Workshop on Machine Learning in HPC Environments》|2016年|56-62|共7页
会议地点 Salt Lake City(US)
作者
Onkar Bhardwaj; Guojing Cong;
展开▼
作者单位

IBM T. J. Watson Reseach, Yorktown Heights, NY, USA;

IBM T. J. Watson Reseach, Yorktown Heights, NY, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Convergence; Mathematical analysis; Servers; Training; Neural networks; Linear programming; Convolution;

机译：收敛；数学分析；服务器；训练；神经网络；线性规划；卷积;

相似文献

外文文献
中文文献
专利

1. Distributed and asynchronous Stochastic Gradient Descent with variance reduction [J] . Ming Yuewei, Zhao Yawei, Wu Chengkun, Neurocomputing . 2018,第MARa15期

机译：具有减少方差的分布式和异步随机梯度下降
2. Asynchronous Decentralized Parallel Stochastic Gradient Descent [J] . Xiangru Lian, Wei Zhang, Ce Zhang, JMLR: Workshop and Conference Proceedings . 2018,第3期

机译：异步分散并行随机梯度下降
3. Asynchronous Decentralized Parallel Stochastic Gradient Descent [J] . Xiangru Lian, Wei Zhang, Ce Zhang, JMLR: Workshop and Conference Proceedings . 2018,第3期

机译：异步分散并行随机梯度下降
4. Practical Efficiency of Asynchronous Stochastic Gradient Descent [C] . Onkar Bhardwaj, Guojing Cong Workshop on Machine Learning in HPC Environments . 2016

机译：异步随机梯度下降的实用效率
5. An Investigation of Stochastic Gradient Descent Dynamics of Neural Networks [D] . Luo, Victor. 2021

机译：神经网络随机梯度下降动力学研究
6. Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent [O] . Christopher De Sa, Matthew Feldman, Christopher Ré, -1

机译：理解和优化异步低精度随机梯度下降
7. The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory [O] . Dan Alistarh, Christopher De Sa, Nikola Konstantinov 2018

机译：异步共享存储器中随机梯度下降的趋同

Practical Efficiency of Asynchronous Stochastic Gradient Descent

摘要

著录项

相似文献

相关主题

期刊订阅