首页> 外文期刊>Bioinformatics >Classification with correlated features: unreliability of feature ranking and solutions
【24h】

Classification with correlated features: unreliability of feature ranking and solutions

机译:具有相关特征的分类:特征排名和解决方案不可靠

获取原文
获取原文并翻译 | 示例
       

摘要

Motivation: Classification and feature selection of genomics or transcriptomics data is often hampered by the large number of features as compared with the small number of samples available. Moreover, features represented by probes that either have similar molecular functions (gene expression analysis) or genomic locations (DNA copy number analysis) are highly correlated. Classical model selection methods such as penalized logistic regression or random forest become unstable in the presence of high feature correlations. Sophisticated penalties such as group Lasso or fused Lasso can force the models to assign similar weights to correlated features and thus improve model stability and interpretability. In this article, we show that the measures of feature relevance corresponding to the above-mentioned methods are biased such that the weights of the features belonging to groups of correlated features decrease as the sizes of the groups increase, which leads to incorrect model interpretation and misleading feature ranking.
机译:动机:基因组学或转录组学数据的分类和特征选择通常由于数量众多的特征而不是数量较少的样本而受到阻碍。而且,由具有相似分子功能(基因表达分析)或基因组位置(DNA拷贝数分析)的探针代表的特征高度相关。在存在高特征相关性的情况下,诸如惩罚逻辑回归或随机森林之类的经典模型选择方法变得不稳定。复杂的惩罚(例如组套索或融合套索)可以迫使模型将相似的权重分配给相关特征,从而提高模型的稳定性和可解释性。在本文中,我们表明与上述方法相对应的特征相关性度量存在偏差,使得随着相关特征组的大小增加,属于相关特征组的特征权重将减小,从而导致错误的模型解释和具有误导性的功能排名。

著录项

  • 来源
    《Bioinformatics》 |2011年第14期|p.1986-1994|共9页
  • 作者

    Thomas Lengauer;

  • 作者单位
  • 收录信息 美国《科学引文索引》(SCI);美国《化学文摘》(CA);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号