首页> 外文会议> >An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics
【24h】

An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics

机译:解决功能基因组学中有监督学习问题的数据不平衡问题的无监督学习方法

获取原文

摘要

Learning from imbalanced data occurs very frequently in functional genomic applications. One positive example to thousands of negative instances is common in scientific applications. Unfortunately, traditional machine learning treats the extremely small instances as noise. The standard approach for this difficulty is balancing training data by resampling them. However, this results in high false positive predictions. Hence, we propose preprocessing majority instances by partitioning them into clusters. This greatly reduces the ambiguity between minority instances and instances in each cluster. For moderately high imbalance ratio and low in-class complexity, our technique gives better prediction accuracy than undersampling method. For extreme imbalance ratio like splice site prediction problem, we demonstrate that this technique serves as a good filter with almost perfect recall that reduces the amount of imbalance so that traditional classification techniques can be deployed and yield significant improvements over previous predictor. We also show that the technique works for sub cellular localization and post-translational modification site prediction problems.
机译:在功能基因组应用中,经常从不平衡数据中学习。在科学应用中,成千上万个否定实例的一个正面例子很常见。不幸的是,传统的机器学习将极小的实例视为噪音。解决此难题的标准方法是通过重新采样来平衡训练数据。但是,这会导致较高的假阳性预测。因此,我们建议通过将多数实例划分为多个群集来对其进行预处理。这极大地减少了少数实例与每个群集中的实例之间的歧义。对于中等较高的不平衡比和较低的类内复杂度,我们的技术比欠采样方法具有更好的预测精度。对于诸如接头位置预测问题之类的极端不平衡比率,我们证明了该技术可作为具有几乎完美召回效果的良好过滤器,可减少不平衡的程度,从而可以部署传统分类技术,并且比以前的预测指标有显着改进。我们还显示该技术适用于亚细胞定位和翻译后修饰位点预测问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号