首页> 外文会议>Third IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference for Machine Learning and Knowledge Extraction >Ranked MSD: A New Feature Ranking and Feature Selection Approach for Biomarker Identification
【24h】

Ranked MSD: A New Feature Ranking and Feature Selection Approach for Biomarker Identification

机译:排名MSD:生物标记识别的新功能排名和特征选择方法

获取原文

摘要

In the era of big data when a huge amount of data is continuously being generated, it is common for situations to arise where the number of samples is much smaller than the number of features (variables) per sample. This phenomenon is often found in biomedical domains, where we may have relatively few patients, compared to the amount of data per patient. For example, gene expression data typically has between 10,000 and 60,000 features per sample. A separate issue arises from the "right to explanation" found in the European General Data Protection Regulation (GDPR), which may prevent the use of black-box models in applications where explainability is required. In such situations, there is a need for robust algorithms which can identify the relevant features from experimental data by discarding irrelevant ones, yielding a simpler subset that facilitates explanation. To address these needs, we have developed a new algorithm for feature ranking and feature selection, named Ranked MSD. We have tested our proposed approach on two real-world gene expression data sets, both of which relate to respiratory viral infections. This Ranked MSD feature selection algorithm is able to reduce the feature set size from 12,023 genes (features) to 65 genes on the first data set and from 20,737 genes to 31 genes on the second data set, in both cases without any significant loss in disease prediction accuracy. In an alternative configuration, our proposed algorithm is able to identify a small subset of features that gives better accuracy than that of the full feature set. Our proposed algorithm can also identify important biomarkers (genes) with their importance score for a particular disease and the identified top-ranked biomarkers can play a vital role in drug discovery and precision medicine.
机译:在持续生成大量数据时,在大量数据的时代中,出现的情况是常见的,其中样品的数量远小于每个样本的特征数(变量)。这种现象通常在生物医学结构域中发现,与每位患者的数据量相比,我们可能有相对较少的患者。例如,基因表达数据通常具有每个样品的10,000和60,000个特征。在欧洲一般数据保护条例(GDPR)中发现的“解释权”出现了一个单独的问题,这可能会阻止在需要解释性的应用中使用黑箱模型。在这种情况下,需要鲁棒算法,其可以通过丢弃不相关的识别实验数据的相关特征,从而产生便于解释的更简单的子集。为满足这些需求,我们开发了一种新的特征排名和特征选择的新算法,名为Disputed MSD。我们在两种真实基因表达数据集上测试了我们提出的方法,两者都涉及呼吸道病毒感染。该排名的MSD特征选择算法能够将第12,023个基因(特征)的特征设定大小从第一个数据集中的12,023基因(特征)降低到65个基因,在两种情况下,在两种情况下,在两种情况下,在两种情况下,在两种情况下没有任何显着损失预测准确性。在替代配置中,我们所提出的算法能够识别提供比完整功能集的特征的小功能子集。我们所提出的算法还可以识别重要的生物标志物(基因),其重要评分对于特定疾病,所确定的排名生物标志物可以在药物发现和精密药中发挥重要作用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号