首页> 外文会议>Machine Learning and Applications and Workshops (ICMLA), 2011 10th International Conference on >Predicting 'Essential' Genes across Microbial Genomes: A Machine Learning Approach
【24h】

Predicting 'Essential' Genes across Microbial Genomes: A Machine Learning Approach

机译:跨微生物基因组预测“必要”基因:一种机器学习方法

获取原文

摘要

Essential genes constitute the minimal set of genes an organism needs for its survival. Identification of essential genes is of theoretical interest to genome biologist and has practical applications in medicine and biotechnology. This paper presents and evaluates machine learning approaches to the problem of predicting essential genes in microbial genomes using solely sequence derived input features. We investigate three different supervised classification methods -- Support Vector Machine (SVM), Artificial Neural Network (ANN), and Decision Tree (DT) -- for this binary classification task. The classifiers are trained and evaluated using 37830 examples obtained from 14 experimentally validated, taxonomically diverse microbial genomes whose essential genes are known. A set of 52 relevant genomic sequence derived features is used as input for the classifiers. The models were evaluated using novel blind testing schemes Leave-One-Genome-Out (LOGO) and Leave-One-Taxon-group-Out (LOTO) and 10-fold stratified cross validation (10-f-cv) strategy on both the full multi-genome datasets and its class imbalance reduced variants. Experimental results (10 X 10-f-cv) indicate SVM and ANN perform better than DT with Area under the Receiver Operating Characteristics (AU-ROC) scores of 0.80, 0.79 and 0.68 respectively. This study demonstrates that supervised machine learning methods can be used to predict essential genes in microbial genomes by using only gene sequence and features derived from it. LOGO and LOTO Blind test results suggest that the trained classifiers generalize across genomes and taxonomic boundaries.
机译:必需基因构成了生物生存所需的最小基因集。必需基因的鉴定是基因组生物学家的理论兴趣,并在医学和生物技术中具有实际应用。本文提出并评估了机器学习方法,该方法仅使用序列衍生的输入特征即可预测微生物基因组中的必需基因。我们针对此二进制分类任务研究了三种不同的监督分类方法-支持向量机(SVM),人工神经网络(ANN)和决策树(DT)。使用从已知的必需基因已知的14个经过实验验证的,分类学上不同的微生物基因组中获得的37830个示例对分类器进行训练和评估。一组52个相关的基因组序列衍生特征用作分类器的输入。使用新颖的盲法测试方案对模型进行评估,分别采用“假一基因组退出”(LOGO)和“假一人纳税人分组退出”(LOTO)以及10倍分层交叉验证(10-f-cv)策略。完整的多基因组数据集及其类不平衡减少的变体。实验结果(10 X 10-f-cv)表明,SVM和ANN的性能要好于DT,并且在接收器工作特性(AU-ROC)得分下的面积分别为0.80、0.79和0.68。这项研究表明,有监督的机器学习方法可以仅通过使用基因序列及其衍生的特征来预测微生物基因组中的必需基因。 LOGO和LOTO Blind测试结果表明,受过训练的分类器可以跨基因组和分类学边界进行概括。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号