首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent
【24h】

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

机译:加速梯度下降比梯度下降更快地逃避了鞍点

获取原文
           

摘要

Nesterov’s accelerated gradient descent (AGD), an instance of the general family of “momentum methods,” provably achieves faster convergence rate than gradient descent (GD) in the convex setting. While these methods are widely used in modern emph{nonconvex} applications, including training of deep neural networks, whether they are provably superior to GD in the nonconvex setting remains open. This paper studies a simple variant of Nesterov’s?AGD, and shows that it escapes saddle points and finds a second-order stationary point in $ilde{O}(1/epsilon^{7/4})$ iterations, matching the best known convergence rate, which is faster than the $ilde{O}(1/epsilon^{2})$ iterations required by GD. To the best of our knowledge, this is the first direct acceleration (single-loop) algorithm that is provably faster than GD in general nonconvex setting—all previous nonconvex accelerated algorithms rely on more complex mechanisms such as nested loops and proximal terms. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases on each step even for nonconvex functions, and (2) a novel framework called?emph{improve or localize}, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.
机译:Nesterov的加速梯度下降(AGD)是“动量法”的一般家族的一个实例,可证明在凸形设置中实现了比梯度下降(GD)更快的收敛速度。虽然这些方法已广泛用于现代 emph {nonconvex}应用程序,包括对深度神经网络的训练,但在非凸面环境中它们是否可证明优于GD尚待证实。本文研究了Nesterov's?AGD的一个简单变体,并表明它避开了鞍点并在$ tilde {O}(1 / epsilon ^ {7/4})$迭代中找到了二阶固定点,与最著名的收敛速度,它比GD所需的$ tilde {O}(1 / epsilon ^ {2})$迭代要快。据我们所知,这是第一个直接加速(单环)算法,在一般的非凸设置中,它证明比GD更快—所有以前的非凸加速算法都依赖于更复杂的机制,例如嵌套循环和近端项。我们的分析基于两个关键思想:(1)受连续时间透视的启发,使用简单的汉密尔顿函数,即使对于非凸函数,AGD在每个步骤上也单调减少;(2)一种称为? emph {improve or localize},对于跟踪基于梯度的优化算法的长期行为非常有用。我们相信这些技术可能会加深我们对加速算法和非凸优化的理解。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号