The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

机译：网络宽对随机梯度下降和泛化的影响：实证研究

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.

机译：我们研究了随机梯度下降发现的最终参数的影响是如何受到过度参数化的影响。我们通过增加基础网络中的通道数来生成模型的系列，然后执行大的超参数搜索以研究测试误差如何取决于学习率，批量大小和网络宽度。我们发现最佳SGD超参数由“归一化噪声刻度”确定，这是批量大小，学习率和初始化条件的函数。在没有批量归一化的情况下，最佳归一化噪声刻度与宽度成正比。更广泛的网络，具有较高的最佳噪声量表，也实现了更高的测试精度。这些观察结果适用于MLP，CUMMNET和RESNET，以及两个不同的参数化方案（“标准”和“NTK”）。我们遵守类似趋势的批量归一化以进行Resnet。令人惊讶的是，由于最大稳定的学习率界定，因此随着宽度的增加，与最佳归一化噪声尺度一致的最大批量大小。

著录项

来源
《International Conference on Machine Learning》|2019年|8418-9124p|共17页
会议地点
作者
Daniel S. Park; Jascha Sohl-Dickstein; Quoc V. Le; Samuel L. Smith;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP181-53;
关键词

相似文献

外文文献
中文文献
专利

1. Generalization Performance of Multi-pass Stochastic Gradient Descent with Convex Loss Functions [J] . Yunwen Lei, Ting Hu, Ke Tang Journal of machine learning research . 2021,第a期

机译：多通随机梯度下降的泛化性能凸损函数
2. Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning: Examining Distributed and Centralized Stochastic Gradient Descent [J] . Pu Shi, Olshevsky Alex, Paschalidis Ioannis Ch. IEEE Signal Processing Magazine . 2020,第3期

机译：机器学习分布式随机优化中的渐近网络独立性：检查分布式和集中式随机梯度下降
3. Comparative study of multilayer perceptron-stochastic gradient descent and gradient boosted trees for predicting daily suspended sediment load: The case study of the Mississippi River, U.S. [J] . Sadra Shadkani, Akram Abbaspour, Saeed Samadianfard, 国际泥沙研究（英文版） . 2021,第004期

机译：用于预测日常悬浮沉积物的多层情景 - 随机梯度下降和梯度提升树的比较研究 - 以密西西比河，美国的案例研究
4. The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study [C] . Daniel S. Park, Jascha Sohl-Dickstein, Quoc V. Le, International Conference on Machine Learning . 2019

机译：网络宽对随机梯度下降和泛化的影响：实证研究
5. An Investigation of Stochastic Gradient Descent Dynamics of Neural Networks [D] . Luo, Victor. 2021

机译：神经网络随机梯度下降动力学研究
6. Mutual Information Based Learning Rate Decay for Stochastic Gradient Descent Training of Deep Neural Networks [O] . Shrihari Vasudevan 2020

机译：基于互动信息的学习速率衰减用于深神经网络的随机梯度血统训练
7. Distributed Differentially Private Stochastic Gradient Descent: An Empirical Study [O] . Hegedűs István, Jelasity Márk 2016

机译：分布式差分私人随机梯度下降：一项实证研究
8. Generalization in Backpropagation Networks: An Empirical Study Using Image Data. [R] . Moya, M. M., Fogler, R. J., Hostetler, L. D. 1989

机译：反向传播网络的推广：基于图像数据的实证研究。

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

摘要

著录项

相似文献

相关主题

期刊订阅