首页> 外文会议>IEEE International Conference on Machine Learning and Applications >Does the Inclusion of Data Sampling Improve the Performance of Boosting Algorithms on Imbalanced Bioinformatics Data?
【24h】

Does the Inclusion of Data Sampling Improve the Performance of Boosting Algorithms on Imbalanced Bioinformatics Data?

机译:数据采样是否包含数据采样可以提高升压算法对不平衡生物信息学数据的性能?

获取原文

摘要

Bioinformatics datasets contain many challenging characteristics, such as class imbalance, which adversely impacts the performance of supervised classification models built on these datasets. Techniques such as ensemble learning and data sampling from the domain of data mining can be deployed to alleviate the problem and to improve the classification performance. In this study, we sought to seek whether inclusion of data sampling within the ensemble framework can further improve the performance of classification models. To this end, we performed an experimental study using two newly hybrid ensemble techniques, one integrates feature selection within the boosting process and the other incorporates random under-sampling followed by feature selection within the boosting framework, two learners, three forms of feature rankers, and four feature subset sizes on 15 highly imbalanced bioinformatics datasets. Our results and statistical analysis demonstrate that the difference between the two boosting methods is statistically insignificant. Therefore, as the inclusion of data sampling has no significant positive effect on the performance of ensemble classifiers, it is not required to achieve maximum classification performance. To our knowledge, this is the first empirical study that examined the effects of data sampling, random under-sampling, to enhance classification performance of boosting algorithm for highly imbalanced bioinformatics data.
机译:生物信息学数据集包含许多具有挑战性的特征,例如类别不平衡,这对在这些数据集上建立的监督分类模型的性能产生不利影响。可以部署从数据挖掘域的集合学习和数据采样等技术来缓解问题并提高分类性能。在这项研究中,我们试图寻求在集合框架内包含数据采样可以进一步提高分类模型的性能。为此,我们进行了使用两个新的混合集合技术的实验研究中,一个集成功能的升压过程中选择和其他下采样升压框架内,接着特征选择包含无规,二学习者,三种形式特征rankers的,和15个高度不平衡的生物信息学数据集上的四个特征子集大小。我们的结果和统计分析表明,两种升压方法之间的差异是统计上微不足道的。因此,由于包含数据采样对集合分类器的性能没有显着的积极影响,因此不需要实现最大分类性能。据我们所知,这是第一个检测数据采样,随机取样的影响的实证研究,提高了高度不平衡生物信息学数据的升压算法的分类性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号