首页> 外文期刊>Neurocomputing >Convergence analysis of distributed stochastic gradient descent with shuffling
【24h】

Convergence analysis of distributed stochastic gradient descent with shuffling

机译:带混洗的分布随机梯度下降的收敛性分析

获取原文
获取原文并翻译 | 示例

摘要

When using stochastic gradient descent (SGD) to solve large-scale machine learning problems especially deep learning problems, a common practice of data processing is to shuffle the training data, partition the data across multiple threads/machines if needed, and then perform several epochs of training on the reshuffled (either locally or globally) data. The above procedure makes the instances used to compute the gradients no longer independently sampled from the training data set, which contradicts with the basic assumptions of conventional convergence analysis of SGD. Then does the distributed SGD method have desirable convergence properties in this practical situation? In this paper, we give answers to this question. First, we give a mathematical formulation for the practical data processing procedure in distributed machine learning, which we call (data partition with) global/local shuffling. We observe that global shuffling is equivalent to without-replacement sampling if the shuffling operations are independent. Second, we prove SGD with global shuffling and local shuffling has convergence guarantee for non-convex tasks like deep learning. The convergence rate for local shuffling is slower than that for global shuffling, since it will lose some information if there's no communication between partitioned data. We also consider the situation when the permutation after shuffling is not uniformly distributed (We call it insufficient shuffling), and discuss the condition under which this insufficiency will not influence the convergence rate. Finally, we give the convergence analysis in convex case. An interesting finding is that, the non-convex tasks like deep learning are more suitable to apply shuffling comparing to the convex tasks. Our theoretical results provide important insights to large-scale machine learning, especially in the selection of data processing methods in order to achieve faster convergence and good speedup. Our theoretical findings are verified by extensive experiments on logistic regression and deep neural networks. (c) 2019 Elsevier B.V. All rights reserved.
机译:当使用随机梯度下降(SGD)解决大规模机器学习问题(特别是深度学习问题)时,数据处理的一种常见做法是将训练数据混洗,在需要时将数据划分到多个线程/机器上,然后执行几个时期改组(本地或全局)数据的培训。上面的过程使得用于计算梯度的实例不再从训练数据集中独立采样,这与SGD的常规收敛分析的基本假设相矛盾。那么,在这种实际情况下,分布式SGD方法是否具有理想的收敛性?在本文中,我们给出了这个问题的答案。首先,我们为分布式机器学习中的实际数据处理过程提供了数学公式,我们称其为(数据分区)全局/局部改组。我们观察到,如果改组操作是独立的,则全局改组等效于无替换采样。其次,我们证明了具有全局改组和局部改组的SGD对于诸如深度学习之类的非凸任务具有收敛性保证。局部改组的收敛速度比全局改组的收敛速度慢,因为如果分区数据之​​间没有通信,它将丢失一些信息。我们还考虑了混洗后排列不均匀分布的情况(我们称其为混洗不足),并讨论了这种不足不会影响收敛速度的条件。最后,给出了凸情况下的收敛性分析。一个有趣的发现是,与凸任务相比,像深度学习这样的非凸任务更适合于应用改组。我们的理论结果为大规模机器学习提供了重要的见识,尤其是在选择数据处理方法方面,以实现更快的收敛和良好的加速。我们的理论发现已通过对逻辑回归和深度神经网络的大量实验验证。 (c)2019 Elsevier B.V.保留所有权利。

著录项

  • 来源
    《Neurocomputing》 |2019年第14期|46-57|共12页
  • 作者单位

    Peking Univ, 5 Yiheyuan Rd Haidian Dist, Beijing, Peoples R China;

    Microsoft Res Asia, Machine Learning Grp, 5 Dan Ling St Haidian Dist, Beijing, Peoples R China;

    Beijing Jiaotong Univ, 3 Shangyuancun Haidian Dist, Beijing, Peoples R China;

    Chinese Acad Math & Syst Sci, 55 Zhongguancun East Rd Haidian Dist, Beijing, Peoples R China;

    Microsoft Res Asia, Machine Learning Grp, 5 Dan Ling St Haidian Dist, Beijing, Peoples R China;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Deep learning; Stochastic gradient descent; Distributed computing; Non-convex optimization; Shuffling;

    机译:深度学习;随机梯度下降;分布式计算;非凸优化;混洗;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号