...
首页> 外文期刊>Journal of biomedical informatics. >Classification and knowledge discovery in protein databases.
【24h】

Classification and knowledge discovery in protein databases.

机译:蛋白质数据库中的分类和知识发现。

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

We consider the problem of classification in noisy, high-dimensional, and class-imbalanced protein datasets. In order to design a complete classification system, we use a three-stage machine learning framework consisting of a feature selection stage, a method addressing noise and class-imbalance, and a method for combining biologically related tasks through a prior-knowledge based clustering. In the first stage, we employ Fisher's permutation test as a feature selection filter. Comparisons with the alternative criteria show that it may be favorable for typical protein datasets. In the second stage, noise and class imbalance are addressed by using minority class over-sampling, majority class under-sampling, and ensemble learning. The performance of logistic regression models, decision trees, and neural networks is systematically evaluated. The experimental results show that in many cases ensembles of logistic regression classifiers may outperform more expressive models due to their robustness to noise and low sample density in a high-dimensional feature space. However, ensembles of neural networks may be the best solution for large datasets. In the third stage, we use prior knowledge to partition unlabeled data such that the class distributions among non-overlapping clusters significantly differ. In our experiments, training classifiers specialized to the class distributions of each cluster resulted in a further decrease in classification error.
机译:我们考虑在嘈杂的,高维的和类别不平衡的蛋白质数据集中进行分类的问题。为了设计一个完整的分类系统,我们使用一个三阶段的机器学习框架,该框架包括一个特征选择阶段,一个解决噪声和类不平衡的方法以及一种通过基于先验知识的聚类来组合生物学相关任务的方法。在第一阶段,我们将Fisher置换测试用作特征选择过滤器。与替代标准的比较表明,它可能对典型的蛋白质数据集有利。在第二阶段,通过使用少数群体过度采样,多数群体不足采样和集成学习来解决噪声和群体失衡。逻辑评估了回归模型,决策树和神经网络的性能。实验结果表明,在许多情况下,由于逻辑回归分类器在高维特征空间中对噪声的鲁棒性和较低的样本密度,它们的性能可能优于更具表现力的模型。但是,神经网络的集成可能是大型数据集的最佳解决方案。在第三阶段,我们使用先验知识对未标记的数据进行分区,以使非重叠群集之间的类分布显着不同。在我们的实验中,专门针对每个聚类的类别分布的训练分类器导致分类误差进一步降低。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号