...
首页> 外文期刊>BMC Systems Biology >Identifying the missing proteins in human proteome by biological language model
【24h】

Identifying the missing proteins in human proteome by biological language model

机译:通过生物学语言模型识别蛋白质组中缺失的蛋白质

获取原文
           

摘要

Background With the rapid development of high-throughput sequencing technology, the proteomics research becomes a trendy field in the post genomics era. It is necessary to identify all the native-encoding protein sequences for further function and pathway analysis. Toward that end, the Human Proteome Organization lunched the Human Protein Project in 2011. However many proteins are hard to be detected by experiment methods, which becomes one of the bottleneck in Human Proteome Project. In consideration of the complicatedness of detecting these missing proteins by using wet-experiment approach, here we use bioinformatics method to pre-filter the missing proteins. Results Since there are analogy between the biological sequences and natural language, the n -gram models from Natural Language Processing field has been used to filter the missing proteins. The dataset used in this study contains 616 missing proteins from the “uncertain” category of the neXtProt database. There are 102 proteins deduced by the n -gram model, which have high probability to be native human proteins. We perform a detail analysis on the predicted structure and function of these missing proteins and also compare the high probability proteins with other mass spectrum datasets. The evaluation shows that the results reported here are in good agreement with those obtained by other well-established databases. Conclusion The analysis shows that 102 proteins may be native gene-coding proteins and some of the missing proteins are membrane or natively disordered proteins which are hard to be detected by experiment methods.
机译:背景技术随着高通量测序技术的飞速发展,蛋白质组学的研究成为后基因组学时代的一个潮流领域。有必要鉴定所有天然编码的蛋白质序列,以用于进一步的功能和途径分析。为此,人类蛋白质组组织于2011年在人类蛋白质计划中吃了午餐。然而,许多蛋白质很难通过实验方法检测到,这成为人类蛋白质组计划的瓶颈之一。考虑到使用湿实验法检测这些缺失蛋白的复杂性,在这里我们使用生物信息学方法对缺失蛋白进行预过滤。结果由于生物学序列和自然语言之间存在类比,因此自然语言处理领域的n元语法模型已用于过滤缺失的蛋白质。本研究中使用的数据集包含来自neXtProt数据库“不确定”类别的616种缺失蛋白。通过n-gram模型推断出的蛋白质有102种,很有可能是天然人类蛋白质。我们对这些缺失蛋白的预测结构和功能进行了详细分析,还将高概率蛋白与其他质谱数据集进行了比较。评估表明,此处报告的结果与其他公认的数据库所获得的结果高度一致。结论分析表明102种蛋白质可能是天然的基因编码蛋白质,而某些缺失的蛋白质是膜或天然无序的蛋白质,很难用实验方法检测到。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号