Convergence analysis of distributed stochastic gradient descent with shuffling

Meng Qi; Chen Wei; Wang Yue; Ma Zhi-Ming; Liu Tie-Yan

首页> 外文期刊>Neurocomputing >Convergence analysis of distributed stochastic gradient descent with shuffling

【24h】

Convergence analysis of distributed stochastic gradient descent with shuffling

机译：随机随机梯度下降的收敛性分析

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

When using stochastic gradient descent (SGD) to solve large-scale machine learning problems especially deep learning problems, a common practice of data processing is to shuffle the training data, partition the data across multiple threads/machines if needed, and then perform several epochs of training on the reshuffled (either locally or globally) data. The above procedure makes the instances used to compute the gradients no longer independently sampled from the training data set, which contradicts with the basic assumptions of conventional convergence analysis of SGD. Then does the distributed SGD method have desirable convergence properties in this practical situation? In this paper, we give answers to this question. First, we give a mathematical formulation for the practical data processing procedure in distributed machine learning, which we call (data partition with) global/local shuffling. We observe that global shuffling is equivalent to without-replacement sampling if the shuffling operations are independent. Second, we prove SGD with global shuffling and local shuffling has convergence guarantee for non-convex tasks like deep learning. The convergence rate for local shuffling is slower than that for global shuffling, since it will lose some information if there's no communication between partitioned data. We also consider the situation when the permutation after shuffling is not uniformly distributed (We call it insufficient shuffling), and discuss the condition under which this insufficiency will not influence the convergence rate. Finally, we give the convergence analysis in convex case. An interesting finding is that, the non-convex tasks like deep learning are more suitable to apply shuffling comparing to the convex tasks. Our theoretical results provide important insights to large-scale machine learning, especially in the selection of data processing methods in order to achieve faster convergence and good speedup. Our theoretical findings are verified by extensive experiments on logistic regression and deep neural networks. (c) 2019 Elsevier B.V. All rights reserved.

机译：当使用随机梯度下降（SGD）来解决大规模机器学习问题尤其是深度学习问题时，数据处理的常见做法是将训练数据进行洗牌，如果需要，将数据分区多个线程/机器，然后执行几个时期在重新装潢（本地或全球）数据上培训。上述过程使得用于计算梯度的实例不再从训练数据集独立上采样，这与SGD的传统收敛分析的基本假设相矛盾。然后分布式SGD方法在这种实际情况下具有所需的收敛性吗？在本文中，我们向这个问题答案。首先，我们为分布式机器学习中的实际数据处理过程提供了数学制定，我们调用（数据分区）全局/本地洗机。如果次次操作是独立的，我们观察到全局洗牌相当于无需更换采样。其次，我们通过全球洗牌和本地洗牌公司证明了SGD，对深度学习等非凸任务具有收敛保障。本地洗机的收敛速度比全球洗牌速度慢，因为如果分区数据之间没有通信，它将丢失一些信息。我们还考虑在洗牌后置换不均匀分布的情况（我们称之为洗牌不足），讨论这种不足不会影响收敛速度的条件。最后，我们在凸面的情况下给予收敛分析。一个有趣的发现是，像深度学习一样的非凸任务更适合应用与凸任务相比的洗牌。我们的理论结果为大规模机器学习提供了重要的见解，特别是在选择数据处理方法中，以实现更快的收敛和良好的加速。我们的理论调查结果通过对逻辑回归和深神经网络的广泛实验来验证。（c）2019 Elsevier B.v.保留所有权利。

著录项

来源
《Neurocomputing》 |2019年第14期|46-57|共12页
作者
Meng Qi; Chen Wei; Wang Yue; Ma Zhi-Ming; Liu Tie-Yan;
展开▼
作者单位

Peking Univ 5 Yiheyuan Rd Haidian Dist Beijing Peoples R China;

Microsoft Res Asia Machine Learning Grp 5 Dan Ling St Haidian Dist Beijing Peoples R China;

Beijing Jiaotong Univ 3 Shangyuancun Haidian Dist Beijing Peoples R China;

Chinese Acad Math & Syst Sci 55 Zhongguancun East Rd Haidian Dist Beijing Peoples R China;

Microsoft Res Asia Machine Learning Grp 5 Dan Ling St Haidian Dist Beijing Peoples R China;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Deep learning; Stochastic gradient descent; Distributed computing; Non-convex optimization; Shuffling;

机译：深度学习;随机梯度下降;分布式计算;非凸优化;洗牌;

相似文献

外文文献
中文文献
专利

1. Convergence analysis of distributed stochastic gradient descent with shuffling [J] . Meng Qi, Chen Wei, Wang Yue, Neurocomputing . 2019,第APRa14期

机译：带混洗的分布随机梯度下降的收敛性分析
2. Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning: Examining Distributed and Centralized Stochastic Gradient Descent [J] . Pu Shi, Olshevsky Alex, Paschalidis Ioannis Ch. IEEE Signal Processing Magazine . 2020,第3期

机译：机器学习分布式随机优化中的渐近网络独立性：检查分布式和集中式随机梯度下降
3. A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates [J] . Yossi Arjevani, Ohad Shamir, Nathan Srebro JMLR: Workshop and Conference Proceedings . 2020,第4期

机译：时滞更新的随机梯度下降的紧收敛性分析。
4. Fast Convergence for Stochastic and Distributed Gradient Descent in the Interpolation Limit [C] . Partha P Mitra European Signal Processing Conference . 2018

机译：插值极限中随机和分布梯度下降的快速收敛
5. Asymptotic Analysis of Accelerated Stochastic Gradient Descent [D] . Wu, Shang. 2020

机译：加速随机梯度下降的渐近分析
6. Parameter inference for discretely observed stochastic kinetic models using stochastic gradient descent [O] . Yuanfeng Wang, Scott Christley, Eric Mjolsness, 2010

机译：使用随机梯度下降法的离散观测随机动力学模型的参数推断
7. Convergence analysis of distributed stochastic gradient descent with shuffling [O] . Qi Meng, Wei Chen, Yue Wang, 2019

机译：随机随机梯度下降的收敛性分析

Convergence analysis of distributed stochastic gradient descent with shuffling

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅