We consider the problem of learning an one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation function, i.e., $f(Z; w, a) = sum_j a_jsigma(w^op Z_j)$, in which both the convolutional weights $w$ and the output weights $a$ are parameters to be learned. We prove that with Gaussian input $mathbf{Z}$ there is a spurious local minimizer. Surprisingly, in the presence of the spurious local minimizer, starting from randomly initialized weights, gradient descent with weight normalization can still be proven to recover the true parameters with constant probability (which can be boosted to probability $1$ with multiple restarts). We also show that with constant probability, the same procedure could also converge to the spurious local minimum, showing that the local minimum plays a non-trivial role in the dynamics of gradient descent. Furthermore, a quantitative analysis shows that the gradient descent dynamics has two phases: it starts off slow, but converges much faster after several iterations.
展开▼
机译:我们考虑学习具有非重叠卷积层和ReLU激活函数的单层神经网络的问题,即$ f(Z; w,a)= sum_j a_j sigma(w ^ top Z_j)$ ,其中卷积权重$ w $和输出权重$ a $都是要学习的参数。我们证明高斯输入$ mathbf {Z} $有一个伪局部极小值。令人惊讶的是,在存在伪局部最小化器的情况下,从随机初始化的权重开始,仍可以证明具有权重归一化的梯度下降能够以恒定的概率恢复真实参数(可以通过多次重新启动将其提升为概率$ 1 $)。我们还表明,以恒定的概率,相同的过程也可以收敛到虚假的局部最小值,表明局部最小值在梯度下降的动力学中起着重要作用。此外,定量分析表明,梯度下降动力学有两个阶段:起步缓慢,但经过多次迭代收敛快得多。
展开▼