Convergence analysis of distributed stochastic gradient descent with shuffling

Meng Qi; Chen Wei; Wang Yue; Ma Zhi-Ming; Liu Tie-Yan

首页> 外文期刊>Neurocomputing >Convergence analysis of distributed stochastic gradient descent with shuffling

【24h】

Convergence analysis of distributed stochastic gradient descent with shuffling

机译：带混洗的分布随机梯度下降的收敛性分析

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

When using stochastic gradient descent (SGD) to solve large-scale machine learning problems especially deep learning problems, a common practice of data processing is to shuffle the training data, partition the data across multiple threads/machines if needed, and then perform several epochs of training on the reshuffled (either locally or globally) data. The above procedure makes the instances used to compute the gradients no longer independently sampled from the training data set, which contradicts with the basic assumptions of conventional convergence analysis of SGD. Then does the distributed SGD method have desirable convergence properties in this practical situation? In this paper, we give answers to this question. First, we give a mathematical formulation for the practical data processing procedure in distributed machine learning, which we call (data partition with) global/local shuffling. We observe that global shuffling is equivalent to without-replacement sampling if the shuffling operations are independent. Second, we prove SGD with global shuffling and local shuffling has convergence guarantee for non-convex tasks like deep learning. The convergence rate for local shuffling is slower than that for global shuffling, since it will lose some information if there's no communication between partitioned data. We also consider the situation when the permutation after shuffling is not uniformly distributed (We call it insufficient shuffling), and discuss the condition under which this insufficiency will not influence the convergence rate. Finally, we give the convergence analysis in convex case. An interesting finding is that, the non-convex tasks like deep learning are more suitable to apply shuffling comparing to the convex tasks. Our theoretical results provide important insights to large-scale machine learning, especially in the selection of data processing methods in order to achieve faster convergence and good speedup. Our theoretical findings are verified by extensive experiments on logistic regression and deep neural networks. (c) 2019 Elsevier B.V. All rights reserved.

机译：当使用随机梯度下降（SGD）解决大规模机器学习问题（特别是深度学习问题）时，数据处理的一种常见做法是将训练数据混洗，在需要时将数据划分到多个线程/机器上，然后执行几个时期改组（本地或全局）数据的培训。上面的过程使得用于计算梯度的实例不再从训练数据集中独立采样，这与SGD的常规收敛分析的基本假设相矛盾。那么，在这种实际情况下，分布式SGD方法是否具有理想的收敛性？在本文中，我们给出了这个问题的答案。首先，我们为分布式机器学习中的实际数据处理过程提供了数学公式，我们称其为（数据分区）全局/局部改组。我们观察到，如果改组操作是独立的，则全局改组等效于无替换采样。其次，我们证明了具有全局改组和局部改组的SGD对于诸如深度学习之类的非凸任务具有收敛性保证。局部改组的收敛速度比全局改组的收敛速度慢，因为如果分区数据之间没有通信，它将丢失一些信息。我们还考虑了混洗后排列不均匀分布的情况（我们称其为混洗不足），并讨论了这种不足不会影响收敛速度的条件。最后，给出了凸情况下的收敛性分析。一个有趣的发现是，与凸任务相比，像深度学习这样的非凸任务更适合于应用改组。我们的理论结果为大规模机器学习提供了重要的见识，尤其是在选择数据处理方法方面，以实现更快的收敛和良好的加速。我们的理论发现已通过对逻辑回归和深度神经网络的大量实验验证。（c）2019 Elsevier B.V.保留所有权利。

著录项

来源
《Neurocomputing》 |2019年第14期|46-57|共12页
作者
Meng Qi; Chen Wei; Wang Yue; Ma Zhi-Ming; Liu Tie-Yan;
展开▼
作者单位

Peking Univ, 5 Yiheyuan Rd Haidian Dist, Beijing, Peoples R China;

Microsoft Res Asia, Machine Learning Grp, 5 Dan Ling St Haidian Dist, Beijing, Peoples R China;

Beijing Jiaotong Univ, 3 Shangyuancun Haidian Dist, Beijing, Peoples R China;

Chinese Acad Math & Syst Sci, 55 Zhongguancun East Rd Haidian Dist, Beijing, Peoples R China;

Microsoft Res Asia, Machine Learning Grp, 5 Dan Ling St Haidian Dist, Beijing, Peoples R China;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Deep learning; Stochastic gradient descent; Distributed computing; Non-convex optimization; Shuffling;

机译：深度学习;随机梯度下降;分布式计算;非凸优化;混洗;

相似文献

外文文献
中文文献
专利

1. Convergence analysis of distributed stochastic gradient descent with shuffling [J] . Meng Qi, Chen Wei, Wang Yue, Neurocomputing . 2019,第Apra14期

机译：随机随机梯度下降的收敛性分析
2. Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning: Examining Distributed and Centralized Stochastic Gradient Descent [J] . Pu Shi, Olshevsky Alex, Paschalidis Ioannis Ch. IEEE Signal Processing Magazine . 2020,第3期

机译：机器学习分布式随机优化中的渐近网络独立性：检查分布式和集中式随机梯度下降
3. A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates [J] . Yossi Arjevani, Ohad Shamir, Nathan Srebro JMLR: Workshop and Conference Proceedings . 2020,第4期

机译：时滞更新的随机梯度下降的紧收敛性分析。
4. Fast Convergence for Stochastic and Distributed Gradient Descent in the Interpolation Limit [C] . Partha P Mitra European Signal Processing Conference . 2018

机译：插值极限中随机和分布梯度下降的快速收敛
5. Asymptotic Analysis of Accelerated Stochastic Gradient Descent [D] . Wu, Shang. 2020

机译：加速随机梯度下降的渐近分析
6. Parameter inference for discretely observed stochastic kinetic models using stochastic gradient descent [O] . Yuanfeng Wang, Scott Christley, Eric Mjolsness, 2010

机译：使用随机梯度下降法的离散观测随机动力学模型的参数推断
7. Convergence analysis of distributed stochastic gradient descent with shuffling [O] . Qi Meng, Wei Chen, Yue Wang, 2019

机译：随机随机梯度下降的收敛性分析

Convergence analysis of distributed stochastic gradient descent with shuffling

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅