...
首页> 外文期刊>In silico biology: An international on computational biology >Protein Subcellular Localization Prediction Using a Hybrid of Similarity Search and Error-Correcting Output Code Techniques That Produces Interpretable Results
【24h】

Protein Subcellular Localization Prediction Using a Hybrid of Similarity Search and Error-Correcting Output Code Techniques That Produces Interpretable Results

机译:蛋白质亚细胞定位预测使用产生可解释结果的相似性搜索和纠错输出代码技术的混合体

获取原文
获取原文并翻译 | 示例
           

摘要

In silico prediction of protein subcellular localization based on amino acid sequence can reveal valuable information about the protein's innate roles in the cell. Unfortunately, such prediction is made difficult because of complex protein sorting signals. Some prediction methods are based on searching for similar proteins with known localization, assuming that known homologs exist. However, it may not perform well on proteins with no known homolog. In contrast, machine learning-based approaches attempt to infer a predictive model that describes the protein sorting signals. Alas, in doing so, it does not take advantage of known homologs (if they exist) by doing a simple "table lookup". Here, we capture the best of both worlds by combining both approaches. On a dataset with 12 locations, similarity-based and machine learning independently achieve an accuracy of 83.8% and 72.6%, respectively. Our hybrid approach yields an improved accuracy of 85.9%. We compared our method with three other methods' published results. For two of the methods, we used their published datasets for comparison. For the third we used the 12 location dataset. The Error Correcting Output Code algorithm was used to construct our predictive model. This algorithm gives attention to all the classes regardless of number of instances and led to high accuracy among each of the classes and a high prediction rate overall. We also illustrated how the machine learning classifier we use, built over a meaningful set of features can produce interpretable rules that may provide valuable insights into complex protein sorting mechanisms.
机译:在计算机上基于氨基酸序列进行蛋白质亚细胞定位的计算机预测可以揭示有关蛋白质在细胞中固有作用的有价值的信息。不幸的是,由于复杂的蛋白质分选信号,使得这种预测变得困难。假设存在已知的同源物,则某些预测方法是基于搜索具有已知定位的相似蛋白质。但是,它可能在没有已知同源物的蛋白质上表现不佳。相反,基于机器学习的方法试图推断描述蛋白质分类信号的预测模型。 las,这样做不会通过执行简单的“表查找”来利用已知的同源物(如果存在)。在这里,我们将两种方法结合起来,抓住了两全其美的方法。在具有12个位置的数据集上,基于相似度和机器学习的独立度分别达到83.8%和72.6%。我们的混合方法可提高85.9%的准确性。我们将我们的方法与其他三种方法的发表结果进行了比较。对于这两种方法,我们使用其已发布的数据集进行比较。对于第三个,我们使用了12个位置数据集。纠错输出代码算法用于构建我们的预测模型。该算法不考虑实例数量而引起对所有类别的关注,并导致每个类别之间的高精度和总体上较高的预测率。我们还说明了我们基于有意义的功能集使用的机器学习分类器如何产生可解释的规则,这些规则可为复杂的蛋白质分选机制提供有价值的见解。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号