首页> 外文会议>IEEE International Conference on Bioinformatics and Bioengineering >Evaluation of Wrapper-Based Feature Selection Using Hard, Moderate, and Easy Bioinformatics Data
【24h】

Evaluation of Wrapper-Based Feature Selection Using Hard, Moderate, and Easy Bioinformatics Data

机译:使用硬,中等和易生物信息学数据评估基于包装的特征选择

获取原文

摘要

One of the most challenging problems encountered when analyzing real-world gene expression datasets is high dimensionality (overabundance of features/attributes). This large number of features can lead to suboptimal classification performance and increased computation time. Feature selection, whereby only a subset of the original features are used for building a classification model, is the most commonly used technique to counter high dimensionality. One category of feature selection called wrapper-based techniques employ a classifier to directly find the subset of features which performs best. Unfortunately, noise can negatively impact the effectiveness of data mining techniques and subsequently lead to suboptimal results. Class noise in particular has a detrimental effect on the classification performance, making datasets perform poorly across a wide range of classifiers (i.e. Having a high "difficulty-of-learning."). No previous work has examined the effectiveness of wrapper-based feature selection when learning from real world high dimensional gene expression datasets in the context of difficulty-of-learning due to noise. To study this effectiveness, we perform experiments using ten gene expression datasets which was first determined to be easy-to-learn-from then had artificial class noise injected in a controlled fashion creating three levels of difficulty-of-learning (Easy, Moderate, and Hard). Using the Nai?ve Bayes learner, we perform wrapper feature selection followed by classification, using four classifiers (Nai?ve Bayes, Multilayer Perceptron, 5-Nearest Neighbor, and Support Vector Machines), and we compare these results to the classification performance without feature selection. The results show that wrapper-based feature selection effectiveness depends on the choice of learner: for Multilayer Perceptron, wrapper selection improved performance compared to not using feature selection, while for Nai?ve Bayes it slightly reduced p- rformance and for the remaining learners it further reduced performance. Because its performance relative to no feature selection varied depending on the choice of learner, we recommend that wrapper selection be at least considered in future bioinformatics experiments, especially if the goal is gene discovery not classification. Also, as dimensionality reduction techniques are not only useful but necessary for high-dimensional bioinformatics datasets, the no-feature-selection case may not be feasible in practice.
机译:分析现实世界基因表达数据集时遇到的最具挑战性的问题之一是高维度(多种功能/属性)。这个大量功能可能导致次优分类性能和增加的计算时间。特征选择,其中仅用于构建分类模型的原始特征的子集是抵抗高维度最常用的技术。一种名为基于包装器的技术的一个类别选择使用分类器来直接查找最佳的功能的子集。不幸的是,噪声可以对数据挖掘技术的有效性产生负面影响,随后导致次优效果。特别是对分类性能有不利影响的类噪声,使数据集在各种分类器上表现不佳(即具有高“学习困难”。)。在由于噪声难以学习的背景下,从真实世界的高维基因表达数据集学习了基于包装的特征选择的有效性。为研究这种有效性,我们使用十个基因表达数据集进行实验,该数据集首先被确定为易学习 - 从那时才能以受控方式注入的人工类噪声,从而产生三个级别的学习难度(简单,中等,和硬)。使用Nai ve Bayes Learner,我们使用四分类器进行分类功能选择。结果表明,基于包装器的特征选择效果取决于学习者的选择:对于多层的Perceptron,与不使用特征选择相比,包装器选择改善了性能,而对于Nai ve Bayes,它略微降低了P- rformance和剩余的学习者进一步降低性能。由于其相对于没有特征选择的性能根据学习者的选择而变化,因此建议在未来的生物信息学实验中至少考虑包装包装选择,特别是如果目标是基因发现不分类。而且,随着维度降低技术不仅有用但是高维生物信息学数据集所必需的,在实践中不可能是可行的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号