首页> 外文会议>AAAI Conference on Artificial Intelligence >Understanding the Disharmony between Weight Normalization Family and Weight Decay
【24h】

Understanding the Disharmony between Weight Normalization Family and Weight Decay

机译:了解重量标准化家庭和体重衰减之间的不和谐

获取原文

摘要

The merits of fast convergence and potentially better performance of the weight normalization family have drawn increasing attention in recent years. These methods use standardization or normalization that changes the weight W to W', which makes W' independent to the magnitude of W. Surprisingly, W must be decayed during gradient descent, otherwise we will observe a severe under-fitting problem, which is very counter-intuitive since weight decay is widely known to prevent deep networks from over-fitting. Moreover, if we substitute (e.g., weight normalization) W' = w/||w|| in the original loss function E_i L(f(x_i; W'), y_i) + 1/2 λ||W'||~2, it is observed that the regularization term 1/2λ||W'||~2 will be canceled as a constant 1/2λ in the optimization objective. Therefore, to decay W, we need to explicitly append: 1/2λ||W||~2. In this paper, we theoretically prove that 1/2λ||W||~2 improves optimization only by modulating the effective learning rate and fairly has no influence on generalization when the weight normalization family is compositely employed. Furthermore, we also expose several serious problems when introducing weight decay term to weight normalization family, including the missing of global minimum, training instability and sensitivity of initialization. To address these problems, we propose an Adaptive Weight Shrink (AWS) scheme, which gradually shrinks the weights during optimization by a dynamic coefficient proportional to the magnitude of the parameter. This simple yet effective method appropriately controls the effective learning rate, which significantly improves the training stability and makes optimization more robust to initialization.
机译:快速收敛的优点和重量标准化家庭的潜在更好的性能近年来越来越受到关注。这些方法使用标准化或标准化,使重量W变为W',这使得与W的大小无关。令人惊讶的是,W必须在梯度下降期间腐烂,否则我们将观察到一个严重的贴合问题,这是非常的由于重量衰减,因此众所周知,因此众所周知,因此可以防止深度网络过度拟合。此外,如果我们替换(例如,重量归一化)w'= w / || w ||在原始损失函数E_I L(f(x_i; w')中,y_i)+1/2λ|| w'||〜2,它被观察到正则化术语1 /2λ|| w'||〜2在优化目标中将被取消为常数1/2λ。因此,要衰减W,我们需要明确地附加:1 /2λ|| w ||〜2。在本文中,我们理论上证明了1 /2λ|| W ||〜2仅通过调节有效学习率并且当综合重量标准化家庭被采用时对泛化没有影响。此外,我们还在将重量衰减术语引入重量标准化家庭时揭示了几个严重的问题,包括遗失的全局最小值,培训不稳定和初始化的敏感性。为了解决这些问题,我们提出了一种自适应重量收缩(AWS)方案,其在优化期间通过与参数的大小成比例的动态系数逐渐缩小重量。这种简单但有效的方法适当地控制了有效的学习率,这显着提高了训练稳定性,并使优化更加强大地初始化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号