首页> 外文会议>International Conference on Machine Learning >Understanding Generalization and Optimization Performance of Deep CNNs
【24h】

Understanding Generalization and Optimization Performance of Deep CNNs

机译:了解深CNN的泛化和优化性能

获取原文

摘要

This work aims to provide understandings on the remarkable success of deep convolutional neural networks (CNNs) by theoretically analyzing their generalization performance and establishing optimization guarantees for gradient descent based training algorithms. Specifically, for a CNN model consisting of l convolutional layers and one fully connected layer, we prove that its generalization error is bounded by O({the square root of}(θq/n)) where θ denotes freedom degree of the network parameters and q = O(log(Π_(i=1)~l b_i(k_i - s_i + 1)/p) + log(b_(l+1))) encapsulates architecture parameters including the kernel size k_i, stride s_i, pooling size p and parameter magnitude b_i. To our best knowledge, this is the first generalization bound that only depends on O(log(Π_(i=1)~(l+1) b_i)), tighter than existing ones that all involve an exponential term like O(Π_(i=1)~(l+1) b_i). Besides, we prove that for an arbitrary gradient descent algorithm, the computed approximate stationary point by minimizing empirical risk is also an approximate stationary point to the population risk. This well explains why gradient descent training algorithms usually perform sufficiently well in practice. Furthermore, we prove the one-to-one correspondence and convergence guarantees for the non-degenerate stationary points between the empirical and population risks. It implies that the computed local minimum for the empirical risk is also close to a local minimum for the population risk, thus ensuring the good generalization performance of CNNs.
机译:这项工作旨在通过理论分析其泛化性能并建立基于梯度下降的训练算法的优化保证来提供对深度卷积神经网络(CNNS)显着成功的理解。具体地,对于由L卷积层和一个完全连接的层组成的CNN模型,我们证明其泛化误差由O({}(θq/ n))界定,其中θ表示网络参数的自由度和q = o(log(π_(i = 1)〜l b_i(k_i-s_i + 1)/ p)+ log(b_(l + 1)))封装架构参数,包括内核大小k_i,stride s_i,池大小p和参数幅度b_i。为了我们的最佳知识,这是第一个只取决于o的泛化界限(log(i = 1)〜(l + 1)b_i)),而不是全部涉及指数术语,如o(π_( i = 1)〜(l + 1)b_i)。此外,我们证明,对于任意梯度下降算法,通过最小化经验风险来计算近似静止点也是人口风险的近似静止点。这很好地解释了为什么梯度下降训练算法通常在实践中非常好地执行。此外,我们证明了实证和人口风险之间非退化的静止点的一对一的对应和收敛保证。它意味着经验风险的计算出的本地最低限度也接近人口风险的局部最小值,从而确保了CNN的良好泛化性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号