We study sparse principal components analysis in high dimensions, where $p$(the number of variables) can be much larger than $n$ (the number ofobservations), and analyze the problem of estimating the subspace spanned bythe principal eigenvectors of the population covariance matrix. We introducetwo complementary notions of $\ell_q$ subspace sparsity: row sparsity andcolumn sparsity. We prove nonasymptotic lower and upper bounds on the minimaxsubspace estimation error for $0\leq q\leq1$. The bounds are optimal for rowsparse subspaces and nearly optimal for column sparse subspaces, they apply togeneral classes of covariance matrices, and they show that $\ell_q$ constrainedestimates can achieve optimal minimax rates without restrictive spikedcovariance conditions. Interestingly, the form of the rates matches knownresults for sparse regression when the effective noise variance is definedappropriately. Our proof employs a novel variational $\sin\Theta$ theorem thatmay be useful in other regularized spectral estimation problems.
展开▼
机译:我们研究高维的稀疏主成分分析,其中$ p $(变量的数量)可能远大于$ n $(观测的数量),并分析了估计总体协方差主特征向量所跨越的子空间的问题矩阵。我们介绍$ \ ell_q $子空间稀疏度的两个补充概念:行稀疏度和列稀疏度。我们证明了$ 0 \ leq q \ leq1 $的minimaxsubspace估计误差的非渐近上下界。该边界对于行稀疏子空间是最佳的,而对于列稀疏子空间则几乎是最佳的,它们适用于协方差矩阵的一般类,并且它们表明$ \ ell_q $约束估计值可以在没有限制性尖峰协方差条件的情况下获得最佳minimax速率。有趣的是,当适当定义有效噪声方差时,速率的形式与稀疏回归的已知结果相匹配。我们的证明采用了新颖的变分\ sin \ Theta定理,该定理可能在其他正则化谱估计问题中很有用。
展开▼