Does the Inclusion of Data Sampling Improve the Performance of Boosting Algorithms on Imbalanced Bioinformatics Data?

机译：数据采样是否包含数据采样可以提高升压算法对不平衡生物信息学数据的性能？

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Bioinformatics datasets contain many challenging characteristics, such as class imbalance, which adversely impacts the performance of supervised classification models built on these datasets. Techniques such as ensemble learning and data sampling from the domain of data mining can be deployed to alleviate the problem and to improve the classification performance. In this study, we sought to seek whether inclusion of data sampling within the ensemble framework can further improve the performance of classification models. To this end, we performed an experimental study using two newly hybrid ensemble techniques, one integrates feature selection within the boosting process and the other incorporates random under-sampling followed by feature selection within the boosting framework, two learners, three forms of feature rankers, and four feature subset sizes on 15 highly imbalanced bioinformatics datasets. Our results and statistical analysis demonstrate that the difference between the two boosting methods is statistically insignificant. Therefore, as the inclusion of data sampling has no significant positive effect on the performance of ensemble classifiers, it is not required to achieve maximum classification performance. To our knowledge, this is the first empirical study that examined the effects of data sampling, random under-sampling, to enhance classification performance of boosting algorithm for highly imbalanced bioinformatics data.

机译：生物信息学数据集包含许多具有挑战性的特征，例如类别不平衡，这对在这些数据集上建立的监督分类模型的性能产生不利影响。可以部署从数据挖掘域的集合学习和数据采样等技术来缓解问题并提高分类性能。在这项研究中，我们试图寻求在集合框架内包含数据采样可以进一步提高分类模型的性能。为此，我们进行了使用两个新的混合集合技术的实验研究中，一个集成功能的升压过程中选择和其他下采样升压框架内，接着特征选择包含无规，二学习者，三种形式特征rankers的，和15个高度不平衡的生物信息学数据集上的四个特征子集大小。我们的结果和统计分析表明，两种升压方法之间的差异是统计上微不足道的。因此，由于包含数据采样对集合分类器的性能没有显着的积极影响，因此不需要实现最大分类性能。据我们所知，这是第一个检测数据采样，随机取样的影响的实证研究，提高了高度不平衡生物信息学数据的升压算法的分类性能。

著录项

来源
《IEEE International Conference on Machine Learning and Applications》|2015年||共8页
会议地点
作者
Alireza Fazelpour; Taghi M. Khoshgoftaar; David J. Dittman; Amri Napolitano;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机软件;
关键词
Boosting; bioinformatics; class imbalance; data sampling; ensemble learning;

机译：提升;生物信息学;类别不平衡;数据采样;集合学习;

相似文献

外文文献
中文文献
专利

1. A Hybrid of Random Over Sample Examples and Boosted C5.0 Algorithms for Breast Cancer Diagnosis on Imbalanced Data [J] . Tian Jianxue, Zhang Jue, Tang Xiaofen, Journal of Medical Imaging and Health Informatics . 2020,第11期

机译：随机对样品实施例的杂交，并提高C5.0抗乳腺癌诊断算法上的不平衡数据
2. Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction [J] . Myoung-Jong Kim, Dae-Ki Kang, Hong Bae Kim Expert Systems with Application . 2015,第3期

机译：基于几何均值的过采样增强算法可解决破产预测中的数据不平衡问题
3. K-Neighbor over-sampling with cleaning data: a new approach to improve classification performance in data sets with class imbalance [J] . Budi Santoso, Hari Wijayanto, Khairil Anwar Notodiputro, Applied mathematical sciences . 2018,第9a12期

机译：使用清洗数据进行K邻域过度采样：一种新方法，可在具有类不平衡的数据集中提高分类性能
4. Does the Inclusion of Data Sampling Improve the Performance of Boosting Algorithms on Imbalanced Bioinformatics Data? [C] . Alireza Fazelpour, Taghi M. Khoshgoftaar, David J. Dittman, IEEE International Conference on Machine Learning and Applications . 2015

机译：包含数据采样是否会提高不平衡生物信息学数据上Boosting算法的性能？
5. Alleviating class imbalance using data sampling: Examining the effects on classification algorithms. [D] . Napolitano, Amri E. 2006

机译：使用数据采样缓解类不平衡：检查对分类算法的影响。
6. Improved PSO_AdaBoost Ensemble Algorithm for Imbalanced Data [O] . Kewen Li, Guangyue Zhou, Jiannan Zhai, 2019

机译：改进的PSO_AdaBoost集成算法用于不平衡数据
7. Improved PSO_AdaBoost Ensemble Algorithm for Imbalanced Data [O] . Kewen Li, Guangyue Zhou, Jiannan Zhai, 2019

机译：改进的PSO_Adaboost集合算法用于不平衡数据

Does the Inclusion of Data Sampling Improve the Performance of Boosting Algorithms on Imbalanced Bioinformatics Data?

摘要

著录项

相似文献

相关主题

期刊订阅