首页> 外文期刊>Neurocomputing >Convergence analysis of distributed stochastic gradient descent with shuffling
【24h】

Convergence analysis of distributed stochastic gradient descent with shuffling

机译:随机随机梯度下降的收敛性分析

获取原文
获取原文并翻译 | 示例
       

摘要

When using stochastic gradient descent (SGD) to solve large-scale machine learning problems especially deep learning problems, a common practice of data processing is to shuffle the training data, partition the data across multiple threads/machines if needed, and then perform several epochs of training on the reshuffled (either locally or globally) data. The above procedure makes the instances used to compute the gradients no longer independently sampled from the training data set, which contradicts with the basic assumptions of conventional convergence analysis of SGD. Then does the distributed SGD method have desirable convergence properties in this practical situation? In this paper, we give answers to this question. First, we give a mathematical formulation for the practical data processing procedure in distributed machine learning, which we call (data partition with) global/local shuffling. We observe that global shuffling is equivalent to without-replacement sampling if the shuffling operations are independent. Second, we prove SGD with global shuffling and local shuffling has convergence guarantee for non-convex tasks like deep learning. The convergence rate for local shuffling is slower than that for global shuffling, since it will lose some information if there's no communication between partitioned data. We also consider the situation when the permutation after shuffling is not uniformly distributed (We call it insufficient shuffling), and discuss the condition under which this insufficiency will not influence the convergence rate. Finally, we give the convergence analysis in convex case. An interesting finding is that, the non-convex tasks like deep learning are more suitable to apply shuffling comparing to the convex tasks. Our theoretical results provide important insights to large-scale machine learning, especially in the selection of data processing methods in order to achieve faster convergence and good speedup. Our theoretical findings are verified by extensive experiments on logistic regression and deep neural networks. (c) 2019 Elsevier B.V. All rights reserved.
机译:当使用随机梯度下降(SGD)来解决大规模机器学习问题尤其是深度学习问题时,数据处理的常见做法是将训练数据进行洗牌,如果需要,将数据分区多个线程/机器,然后执行几个时期在重新装潢(本地或全球)数据上培训。上述过程使得用于计算梯度的实例不再从训练数据集独立上采样,这与SGD的传统收敛分析的基本假设相矛盾。然后分布式SGD方法在这种实际情况下具有所需的收敛性吗?在本文中,我们向这个问题答案。首先,我们为分布式机器学习中的实际数据处理过程提供了数学制定,我们调用(数据分区)全局/本地洗机。如果次次操作是独立的,我们观察到全局洗牌相当于无需更换采样。其次,我们通过全球洗牌和本地洗牌公司证明了SGD,对深度学习等非凸任务具有收敛保障。本地洗机的收敛速度比全球洗牌速度慢,因为如果分区数据之​​间没有通信,它将丢失一些信息。我们还考虑在洗牌后置换不均匀分布的情况(我们称之为洗牌不足),讨论这种不足不会影响收敛速度的条件。最后,我们在凸面的情况下给予收敛分析。一个有趣的发现是,像深度学习一样的非凸任务更适合应用与凸任务相比的洗牌。我们的理论结果为大规模机器学习提供了重要的见解,特别是在选择数据处理方法中,以实现更快的收敛和良好的加速。我们的理论调查结果通过对逻辑回归和深神经网络的广泛实验来验证。 (c)2019 Elsevier B.v.保留所有权利。

著录项

  • 来源
    《Neurocomputing》 |2019年第14期|46-57|共12页
  • 作者单位

    Peking Univ 5 Yiheyuan Rd Haidian Dist Beijing Peoples R China;

    Microsoft Res Asia Machine Learning Grp 5 Dan Ling St Haidian Dist Beijing Peoples R China;

    Beijing Jiaotong Univ 3 Shangyuancun Haidian Dist Beijing Peoples R China;

    Chinese Acad Math & Syst Sci 55 Zhongguancun East Rd Haidian Dist Beijing Peoples R China;

    Microsoft Res Asia Machine Learning Grp 5 Dan Ling St Haidian Dist Beijing Peoples R China;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Deep learning; Stochastic gradient descent; Distributed computing; Non-convex optimization; Shuffling;

    机译:深度学习;随机梯度下降;分布式计算;非凸优化;洗牌;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号