首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Understanding Generalization and Optimization Performance of Deep CNNs
【24h】

Understanding Generalization and Optimization Performance of Deep CNNs

机译:了解深度CNN的泛化和优化性能

获取原文
           

摘要

This work aims to provide understandings on the remarkable success of deep convolutional neural networks (CNNs) by theoretically analyzing their generalization performance and establishing optimization guarantees for gradient descent based training algorithms. Specifically, for a CNN model consisting of $l$ convolutional layers and one fully connected layer, we prove that its generalization error is bounded by $mathcal{O}(sqrt{hetawidetilde{arrho}})$ where $heta$ denotes freedom degree of the network parameters and $widetilde{arrho}=mathcal{O}(log(prod_{i=1}^{l}b_{i} (k_{i}-s_{i}+1)/p)+log(b_{l+1}))$ encapsulates architecture parameters including the kernel size $k_{i}$, stride $s_{i}$, pooling size $p$ and parameter magnitude $b_{i}$. To our best knowledge, this is the first generalization bound that only depends on $mathcal{O}(log(prod_{i=1}^{l+1}b_{i}))$, tighter than existing ones that all involve an exponential term like $mathcal{O}(prod_{i=1}^{l+1}b_{i})$. Besides, we prove that for an arbitrary gradient descent algorithm, the computed approximate stationary point by minimizing empirical risk is also an approximate stationary point to the population risk. This well explains why gradient descent training algorithms usually perform sufficiently well in practice. Furthermore, we prove the one-to-one correspondence and convergence guarantees for the non-degenerate stationary points between the empirical and population risks. It implies that the computed local minimum for the empirical risk is also close to a local minimum for the population risk, thus ensuring that the optimized CNN model well generalizes to new data.
机译:这项工作旨在通过理论上分析深层卷积神经网络的泛化性能并为基于梯度下降的训练算法建立优化保证,来提供对深层卷积神经网络(CNN)巨大成功的理解。具体来说,对于由$ l $卷积层和一个完全连接层组成的CNN模型,我们证明了其泛化误差受$ mathcal {O}( sqrt { theta widetilde { varrho} / n})限制$其中$ theta $表示网络参数的自由度,而$ widetilde { varrho} = mathcal {O}( log( prod_ {i = 1} ^ {l} b_ {i}(k_ {i } -s_ {i} +1)/ p)+ log(b_ {l + 1}))$封装了架构参数,包括内核大小$ k_ {i} $,步幅$ s_ {i} $,合并大小$ p $和参数幅度$ b_ {i} $。据我们所知,这是仅依赖于$ mathcal {O}( log( prod_ {i = 1} ^ {l + 1} b_ {i}))$的第一个泛化约束,比现有约束更严格全部都涉及一个指数项,例如$ mathcal {O}( prod_ {i = 1} ^ {l + 1} b_ {i})$。此外,我们证明了对于任意梯度下降算法,通过最小化经验风险计算出的近似平稳点也是人口风险的近似平稳点。这很好地解释了为什么梯度下降训练算法通常在实践中表现良好。此外,我们证明了经验风险和总体风险之间非退化平稳点的一对一对应性和收敛性保证。这意味着计算得出的经验风险的局部最小值也接近于人口风险的局部最小值,从而确保了优化的CNN模型能够很好地推广到新数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号