...
首页> 外文期刊>EURASIP journal on bioinformatics and systems biology >A top-down approach to classify enzyme functional classes and sub-classes using random forest
【24h】

A top-down approach to classify enzyme functional classes and sub-classes using random forest

机译:自上而下的方法,使用随机森林对酶功能类和亚类进行分类

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Advancements in sequencing technologies have witnessed an exponential rise in the number of newly found enzymes. Enzymes are proteins that catalyze bio-chemical reactions and play an important role in metabolic pathways. Commonly, function of such enzymes is determined by experiments that can be time consuming and costly. Hence, a need for a computing method is felt that can distinguish protein enzyme sequences from those of non-enzymes and reliably predict the function of the former. To address this problem, approaches that cluster enzymes based on their sequence and structural similarity have been presented. But, these approaches are known to fail for proteins that perform the same function and are dissimilar in their sequence and structure. In this article, we present a supervised machine learning model to predict the function class and sub-class of enzymes based on a set of 73 sequence-derived features. The functional classes are as defined by International Union of Biochemistry and Molecular Biology. Using an efficient data mining algorithm called random forest, we construct a top-down three layer model where the top layer classifies a query protein sequence as an enzyme or non-enzyme, the second layer predicts the main function class and bottom layer further predicts the sub-function class. The model reported overall classification accuracy of 94.87% for the first level, 87.7% for the second, and 84.25% for the bottom level. Our results compare very well with existing methods, and in many cases report better performance. Using feature selection methods, we have shown the biological relevance of a few of the top rank attributes.
机译:测序技术的进步见证了新发现的酶的数量呈指数增长。酶是催化生化反应并在代谢途径中起重要作用的蛋白质。通常,此类酶的功能是通过实验确定的,该实验可能既费时又费钱。因此,需要一种能够将蛋白质酶序列与非酶序列区别开来并可靠地预测前者功能的计算方法。为了解决这个问题,已经提出了基于酶的序列和结构相似性使酶聚类的方法。但是,已知这些方法对于执行相同功能且序列和结构不同的蛋白质失败。在本文中,我们提出了一种基于监督的机器学习模型,该模型基于一组73个序列衍生的特征来预测酶的功能类和亚类。功能类别由国际生物化学与分子生物学联合会定义。使用称为随机森林的高效数据挖掘算法,我们构建了一个自上而下的三层模型,其中顶层将查询蛋白序列分类为酶或非酶,第二层预测主要功能类别,而底层进一步预测子功能类别。该模型报告的总体分类准确度第一等级为94.87%,第二等级为87.7%,最低等级为84.25%。我们的结果与现有方法进行了很好的比较,并且在许多情况下报告了更好的性能。使用特征选择方法,我们已经显示了一些顶级属性的生物学相关性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号