首页> 外文会议>International Conference on Machine Learning and Applications >Comparative Analysis on the Stability of Feature Selection Techniques using Three Frameworks on Biological Datasets
【24h】

Comparative Analysis on the Stability of Feature Selection Techniques using Three Frameworks on Biological Datasets

机译:一种在生物数据集中三架构特征选择技术稳定性的比较分析

获取原文

摘要

Feature (gene) selection is a common preprocessing technique used to counter the problem of high dimensionality (too many independent features) found in many bioinformatics datasets, addressing this problem by creating a smaller feature subset including only the most important features. Although feature selection techniques are often evaluated based on how they can help improve classification performance, it is also important to find stable feature selection techniques which will give consistent results even in the face of dataset perturbations (such as class noise or sampling used to alleviate the problem of imbalanced data). This is especially important in bioinformatics, where the prime concern may be gene discovery rather than classification. In this study we use three frameworks to evaluate the stability of gene selection techniques: "sampled-clean vs. sampled-clean," "sampled-noisy vs. sampled-noisy," and "sampled-clean vs. sampled-noisy." All frameworks involve pair-wise comparisons among the results from the perturbed datasets (due to sampling or class noise injection followed by sampling). They differ in terms of whether they observe how sampling can create variation within the feature subsets (sampled-clean vs. sampled-clean), how noisy datasets (which were then sampled) can create a wide spread of selected features (sampled-noisy vs. sampled-noisy), or how features selected on clean and noisy datasets differ, after both datasets have been sampled (sampled-clean vs. sampled-noisy). Along with these three frameworks, our comparison of seven feature ranking techniques uses four cancer gene datasets, applies three sampling techniques, and generates artificial class noise to better simulate real-world datasets. The results from the frameworks are generally similar, with Signal-To-Noise and ReliefF showing the best stability and Gain Ratio showing the worst across all three frameworks, although Relief-W is notable for showing moderate to above-average stability when the clean datasets are used, but giving the second worst performance when noise was present.
机译:功能(基因)的选择是用来对付高维(太多独立的功能)的问题,一个共同的预处理技术,生物信息学的许多数据集发现,通过创建一个较小的特征子集只包括最重要的特点解决这一问题。虽然特征选择技术评估的频次基础上,他们如何能够帮助提高分类的性能,它也是重要的是找到稳定的特征选择技术,将提供一致的结果,即使在数据集中的扰动(如类噪声的脸或采样用于缓解不平衡数据的问题)。这在生物信息学中尤为重要,其中主要关注可能是基因发现而不是分类。在这项研究中,我们使用三种框架来评估的基因选择技术的稳定性:“采样清洁与采样干净”,“采样嘈杂与采样嘈杂,”和“采样的清洁与采样噪声”所有的框架包括从扰动的数据集的结果中成对比较(由于采样或类噪声注射,然后进行采样)。他们在不同的方面他们是否遵守采样如何创建功能子集内的变化(采样清洁与采样干净),如何嘈杂的数据集(然后取样)可以创建一个广泛的选择功能(采样嘈杂VS传播。采样噪声的),或如何选择上纯净和有噪声的数据集的特征而不同,后两个数据集已被采样(采样清洁与采样噪声的)。除了这三个框架之外,我们对七种特征排名技术的比较采用四种癌症基因数据集,应用三种采样技术,并产生人工类噪声以更好地模拟现实世界数据集。从框架的结果大体相似,具有信号噪声和ReliefF呈现最佳的稳定性和收益率呈现在所有三个框架的最糟糕的,尽管救援-W值得注意的是表示对高于平均水平的稳定性时,干净的数据集中度被使用,但给所述第二表现最差时噪声存在。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号