We consider the dynamic of gradient descent for learning a two-layer neural network. We assume the input $xinmathbb{R}^d$ is drawn from a Gaussian distribution and the label of $x$ satisfies $f^{star}(x) = a^{op}|W^{star}x|$, where $ainmathbb{R}^d$ is a nonnegative vector and $W^{star} inmathbb{R}^{dimes d}$ is an orthonormal matrix. We show that an emph{over-parameterized} two layer neural network with ReLU activation, trained by gradient descent from emph{random initialization}, can provably learn the ground truth network with population loss at most $o(1/d)$ in polynomial time with polynomial samples. On the other hand, we prove that any kernel method, including Neural Tangent Kernel, with a polynomial number of samples in $d$, has population loss at least $Omega(1 / d)$.
展开▼
机译:我们考虑学习双层神经网络的梯度下降的动态。我们假设输入$ x in mathbb {r} ^ d $从高斯分发和$ x $的标签绘制,满足$ f ^ { star}(x)= a ^ { top} | w ^ { star} x | $,其中$ a in mathbb {r} ^ d $是一个非负向量和$ w ^ { star} in mathbb {r} ^ {d times d} $正式矩阵。我们展示了一种带有relu激活的 emph {过度参数化}两个层神经网络,通过梯度下降从 memph {随机初始化}训练,可以在最多$ o(1 / d)上以人口损失可被证明地学习地面真相网络多项式时间与多项式样品。另一方面,我们证明了任何内核方法,包括神经切线内核,以美元为单位的样本数量,具有至少$ omega(1 / d)$的人口损失。
展开▼