首页> 外文期刊>Procedia Computer Science >A Comparative Study of Feature Selection and Classification Methods for Gene Expression Data of Glioma
【24h】

A Comparative Study of Feature Selection and Classification Methods for Gene Expression Data of Glioma

机译:脑胶质瘤基因表达数据特征选择和分类方法的比较研究

获取原文

摘要

Microarray gene expression data gained great importance in recent years due to its role in disease diagnoses and prognoses which help to choose the appropriate treatment plan for patients. This technology has shifted a new era in molecular classification. Interpreting gene expression data remains a difficult problem and an active research area due to their native nature of “high dimensional low sample size”. Such problems pose great challenges to existing classification methods. Thus, effective feature selection techniques are often needed in this case to aid to correctly classify different tumor types and consequently lead to a better understanding of genetic signatures as well as improve treatment strategies. This paper aims on a comparative study of state-of-the- art feature selection methods, classification methods, and the combination of them, based on gene expression data. We compared the efficiency of three different classification methods including: support vector machines, k-nearest neighbor and random forest, and eight different feature selection methods, including: information gain, twoing rule, sum minority, max minority, gini index, sum of variances, t-statistics, and one-dimension support vector machine. Five-fold cross validation was used to evaluate the classification performance. Two publicly available gene expression data sets of glioma were used in the experiments. Results revealed the important role of feature selection in classifying gene expression data. By performing feature selection, the classification accuracy can be significantly boosted by using a small number of genes. The relationship of features selected in different feature selection methods is investigated and the most frequent features selected in each fold among all methods for both datasets are evaluated.
机译:近年来,由于微阵列基因表达数据在疾病诊断和预后中的作用,这有助于为患者选择合适的治疗方案,因此微阵列基因表达数据变得非常重要。这项技术已经改变了分子分类的新纪元。解释基因表达数据仍然是一个难题,也是一个活跃的研究领域,这归因于其“高维低样本量”的本性。这些问题对现有的分类方法提出了巨大的挑战。因此,在这种情况下,通常需要有效的特征选择技术来帮助正确分类不同的肿瘤类型,并因此导致对遗传特征的更好理解并改善治疗策略。本文旨在基于基因表达数据,对最先进的特征选择方法,分类方法及其组合进行比较研究。我们比较了三种不同分类方法的效率,这些方法包括:支持向量机,k近邻和随机森林,以及八种不同的特征选择方法,包括:信息增益,二分法则,总和,最大少数,基尼系数,方差和,t统计量和一维支持向量机。五重交叉验证用于评估分类性能。实验中使用了两个可公开获得的神经胶质瘤基因表达数据集。结果揭示了特征选择在基因表达数据分类中的重要作用。通过执行特征选择,可以通过使用少量基因显着提高分类精度。研究了在不同特征选择方法中选择的特征之间的关系,并评估了两个数据集所有方法中每个折叠中选择的最频繁特征。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号