近年来,基于基因表达谱的肿瘤分类问题引起了广泛关注,为癌症的精确诊断及分型提供了极大的便利。然而,由于基因表达谱数据存在样本数量小、维数高、噪声大及冗余度高等特点,给深入准确地挖掘基因表达谱中所蕴含的生物医学知识和肿瘤信息基因选择带来了极大困难。文中提出一种基于迭代Lasso的信息基因选择方法,以获得基因数量少且分类能力较强的信息基因子集。该方法分为两层:第一层采用信噪比指标衡量基因的重要性,以过滤无关基因;第二层采用改进的Lasso方法进行冗余基因的剔除。实验采用5个公开的肿瘤基因表达谱数据集验证了本文方法的可行性和有效性,与已有的信息基因选择方法相比具有更好的分类性能。%Tumor classification based on gene expression profiles, which is of tremendous convenience for cancer accurate diagnosis and subtype recognition, has drawn a great attention in recent years. Due to the characteristics of small samples, high dimensionality, much noise and data redundancy for gene expression profiles, it is difficult to mine biological knowledge from gene expression profiles profoundly and accurately, and it also brings enormous difficulty to informative gene selection in the tumor classification. Therefore, an iterative Lasso-based approach for gene selection, called Gene Selection Based on Iterative Lasso( GSIL) , is proposed to select an informative gene subset with fewer genes and better classification ability. The proposed algorithm mainly involves two steps. In the first step, a gene ranking algorithm, Signal Noise Ratio, is applied to select top-ranked genes as the candidate gene subset, which aims to eliminate irrelevant genes. In the second step, an improved method based on Lasso, Iterative Lasso, is employed to eliminate the redundant genes. The experimental results on 5 public datasets validate the feasibility and effectiveness of the proposed algorithm and demonstrate that it has better classification ability in comparison with other gene selection methods.
展开▼