首页> 美国卫生研究院文献>Computational and Mathematical Methods in Medicine >A Novel Hybrid Dimension Reduction Technique for Undersized High Dimensional Gene Expression Data Sets Using Information Complexity Criterion for Cancer Classification
【2h】

A Novel Hybrid Dimension Reduction Technique for Undersized High Dimensional Gene Expression Data Sets Using Information Complexity Criterion for Cancer Classification

机译:基于信息复杂度标准的超小尺寸高维基因表达数据集的新型混合维数减少技术

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Gene expression data typically are large, complex, and highly noisy. Their dimension is high with several thousand genes (i.e., features) but with only a limited number of observations (i.e., samples). Although the classical principal component analysis (PCA) method is widely used as a first standard step in dimension reduction and in supervised and unsupervised classification, it suffers from several shortcomings in the case of data sets involving undersized samples, since the sample covariance matrix degenerates and becomes singular. In this paper we address these limitations within the context of probabilistic PCA (PPCA) by introducing and developing a new and novel approach using maximum entropy covariance matrix and its hybridized smoothed covariance estimators. To reduce the dimensionality of the data and to choose the number of probabilistic PCs (PPCs) to be retained, we further introduce and develop celebrated Akaike's information criterion (AIC), consistent Akaike's information criterion (CAIC), and the information theoretic measure of complexity (ICOMP) criterion of Bozdogan. Six publicly available undersized benchmark data sets were analyzed to show the utility, flexibility, and versatility of our approach with hybridized smoothed covariance matrix estimators, which do not degenerate to perform the PPCA to reduce the dimension and to carry out supervised classification of cancer groups in high dimensions.
机译:基因表达数据通常较大,复杂且噪声很大。它们的维数很高,具有数千个基因(即特征),但观察值有限(即样本)。尽管经典主成分分析(PCA)方法被广泛用作降维以及有监督和无监督分类的第一个标准步骤,但由于样本协方差矩阵退化并导致样本量过大,因此它在数据集方面存在一些缺点。变得单数。在本文中,我们通过引入和发展使用最大熵协方差矩阵及其混合的平滑协方差估计量的新方法,来解决概率PCA(PPCA)范围内的这些局限性。为了减少数据的维数并选择要保留的概率PC(PPC)的数量,我们进一步引入和开发了著名的Akaike信息标准(AIC),一致的Akaike信息标准(CAIC)以及复杂性的信息理论量度(ICOMP)准则。分析了六个可公开获得的规模较小的基准数据集,以显示我们的方法与混合平滑协方差矩阵估计量的效用,灵活性和通用性,这些方法不会退化以执行PPCA来缩小维度并在癌症人群中进行监督分类高尺寸。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号