首页> 外文会议>IEEE International Conference on Bioinformatics and Bioengineering >Evaluation of Wrapper-Based Feature Selection Using Hard, Moderate, and Easy Bioinformatics Data
【24h】

Evaluation of Wrapper-Based Feature Selection Using Hard, Moderate, and Easy Bioinformatics Data

机译:使用硬,中,容易的生物信息学数据评估基于包装的特征选择

获取原文

摘要

One of the most challenging problems encountered when analyzing real-world gene expression datasets is high dimensionality (overabundance of features/attributes). This large number of features can lead to suboptimal classification performance and increased computation time. Feature selection, whereby only a subset of the original features are used for building a classification model, is the most commonly used technique to counter high dimensionality. One category of feature selection called wrapper-based techniques employ a classifier to directly find the subset of features which performs best. Unfortunately, noise can negatively impact the effectiveness of data mining techniques and subsequently lead to suboptimal results. Class noise in particular has a detrimental effect on the classification performance, making datasets perform poorly across a wide range of classifiers (i.e. Having a high "difficulty-of-learning."). No previous work has examined the effectiveness of wrapper-based feature selection when learning from real world high dimensional gene expression datasets in the context of difficulty-of-learning due to noise. To study this effectiveness, we perform experiments using ten gene expression datasets which was first determined to be easy-to-learn-from then had artificial class noise injected in a controlled fashion creating three levels of difficulty-of-learning (Easy, Moderate, and Hard). Using the Naïve Bayes learner, we perform wrapper feature selection followed by classification, using four classifiers (Naïve Bayes, Multilayer Perceptron, 5-Nearest Neighbor, and Support Vector Machines), and we compare these results to the classification performance without feature selection. The results show that wrapper-based feature selection effectiveness depends on the choice of learner: for Multilayer Perceptron, wrapper selection improved performance compared to not using feature selection, while for Naïve Bayes it slightly reduced p- rformance and for the remaining learners it further reduced performance. Because its performance relative to no feature selection varied depending on the choice of learner, we recommend that wrapper selection be at least considered in future bioinformatics experiments, especially if the goal is gene discovery not classification. Also, as dimensionality reduction techniques are not only useful but necessary for high-dimensional bioinformatics datasets, the no-feature-selection case may not be feasible in practice.
机译:分析现实世界中的基因表达数据集时遇到的最具挑战性的问题之一是高维度(特征/属性过多)。如此众多的功能可能导致分类性能不佳,并增加计算时间。特征选择是最常用的应对高维技术,特征选择仅将原始特征的一个子集用于构建分类模型。一类称为“基于包装器的技术”的特征选择使用分类器直接找到性能最佳的子集。不幸的是,噪声可能会对数据挖掘技术的有效性产生负面影响,并最终导致效果欠佳。类别噪声尤其会对分类性能产生不利影响,从而使数据集在广泛的分类器上表现不佳(即“学习难度很高”)。在由于噪声导致学习困难的情况下,从现实世界中的高维基因表达数据集学习时,以前的工作没有研究基于包装器的特征选择的有效性。为了研究这种效果,我们使用十个基因表达数据集进行了实验,这些数据集首先被确定为易于学习,然后以受控方式注入了人工噪声,从而创建了三个学习难度级别(“简单”,“中等”,“和硬)。使用朴素贝叶斯学习器,我们使用四个分类器(朴素贝叶斯,多层感知器,最近邻和支持向量机)执行包装器特征选择,然后进行分类,然后将这些结果与没有特征选择的分类性能进行比较。结果表明,基于包装器的特征选择效果取决于学习者的选择:对于多层感知器,与不使用特征选择相比,包装器选择提高了性能,而对于朴素贝叶斯,性能略有降低,而对于其余学习者,则进一步降低了性能表现。由于其相对于无特征选择的性能会因学习者的选择而异,因此我们建议至少在未来的生物信息学实验中考虑包装选择,尤其是在目标是发现基因而不进行分类的情况下。此外,由于降维技术不仅对高维生物信息学数据集有用,而且对维维生物信息学数据集来说是必需的,因此无特征选择的情况在实践中可能不可行。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号