首页> 外文会议>IEEE International Conference on Bioinformatics and Biomedicine >Ensemble-based semi-supervised learning approaches for imbalanced splice site datasets
【24h】

Ensemble-based semi-supervised learning approaches for imbalanced splice site datasets

机译:基于集合的不平衡剪接位点数据集的半监督学习方法

获取原文
获取外文期刊封面目录资料

摘要

Producing accurate classifiers depends on the quality and quantity of labeled data. The lack of labeled data, due to its expensive generation, critically affects the application of machine learning algorithms to biological problems. However, unlabeled data may be acquired relatively faster and in larger quantities thanks to current biochemical technologies, called Next Generation Sequencing. In such cases, when the number of labeled instances is overwhelmed by the number of unlabeled instances, semi-supervised learning represents a cost-effective alternative that can improve supervised classifiers by utilizing unlabeled data. In practice, data oftentimes exhibits imbalanced class distributions, which represents an obstacle for both supervised and semi-supervised learning. The problem of supervised learning from imbalanced datasets has been extensively studied, and various solutions have been proposed to produce classifiers with optimal performance on highly skewed class distributions. In the case of semi-supervised learning, there are not as many efforts aimed at the imbalance data problem. In this paper, we study several ensemble-based semi-supervised learning approaches for predicting splice sites, a problem for which the imbalance ratio is very high. We run experiments on five imbalanced datasets with the goal of identifying which variants are the most effective.
机译:产生准确的分类器取决于标记数据的质量和数量。由于生成的数据昂贵,因此缺少标记数据会严重影响机器学习算法在生物学问题上的应用。但是,由于当前称为“下一代测序”的生化技术,未标记的数据可能会相对更快地大量获取。在这种情况下,当标记实例的数量被未标记实例的数量所淹没时,半监督学习代表了一种经济有效的替代方案,可以通过利用未标记的数据来改进监督分类器。在实践中,数据经常表现出不均衡的班级分布,这对有监督和半监督学习都构成了障碍。从不平衡数据集进行监督学习的问题已得到广泛研究,并且提出了各种解决方案以在高度偏斜的类分布上产生具有最佳性能的分类器。在半监督学习的情况下,针对不平衡数据问题的努力并不多。在本文中,我们研究了几种基于整体的半监督学习方法来预测剪接位点,该问题的不平衡率非常高。我们对五个不平衡的数据集进行了实验,目的是确定哪种变体最有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号