首页> 外文会议>International Conference on Machine Learning >The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study
【24h】

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

机译:网络宽对随机梯度下降和泛化的影响:实证研究

获取原文

摘要

We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.
机译:我们研究了随机梯度下降发现的最终参数的影响是如何受到过度参数化的影响。我们通过增加基础网络中的通道数来生成模型的系列,然后执行大的超参数搜索以研究测试误差如何取决于学习率,批量大小和网络宽度。我们发现最佳SGD超参数由“归一化噪声刻度”确定,这是批量大小,学习率和初始化条件的函数。在没有批量归一化的情况下,最佳归一化噪声刻度与宽度成正比。更广泛的网络,具有较高的最佳噪声量表,也实现了更高的测试精度。这些观察结果适用于MLP,CUMMNET和RESNET,以及两个不同的参数化方案(“标准”和“NTK”)。我们遵守类似趋势的批量归一化以进行Resnet。令人惊讶的是,由于最大稳定的学习率界定,因此随着宽度的增加,与最佳归一化噪声尺度一致的最大批量大小。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号