首页> 外文学位 >Prediction of Protein Function with a Probabilistic Model for Analysis of Sequence Similarity Networks and Genomic Context
【24h】

Prediction of Protein Function with a Probabilistic Model for Analysis of Sequence Similarity Networks and Genomic Context

机译:利用概率模型预测蛋白质相似性网络和基因组背景的蛋白质功能

获取原文
获取原文并翻译 | 示例

摘要

The number of known protein sequences is growing faster than the number of curated protein functions. To help bridge this gap, bioinformatics scientists have created automated methods for the prediction of protein function. Recently, the focus has been on integrating numerous data sources, and critical evaluation of these methods show that the integrative approach improves predictive performance. However, a basic BLAST-based method is still a top contender.;Computational biologists often use two complimentary approaches to infer functions that are usually more accurate than a BLAST-based method. Analysis of sequence similarity networks can dissect protein functions in a superfamily and infer the function of individual proteins. Briefly, a computational biologist will create a network of proteins in sequence space, which typically shows clusters of similar proteins. She will then highlight which few of these proteins have experimental functional annotations, and paint the network according to other functional features that are broadly available, such as residues in key positions in an alignment. These data are used to identify proteins where a functional change may have occurred, which then can be used to delineate protein families or other protein groups that share a specific function or functional characteristic. However, molecular functional annotation data are very scarce, and there is not enough of it to draw functional boundaries with high confidence.;The second method, analysis of genomic context, is often done in conjunction with sequence similarity network analysis. This approach uses data about the genome neighbors of a protein, or more generally, any functional association data, such protein -- protein interaction data, to predict a protein's molecular function. This technique has been used to refine functional boundaries during sequence similarity network analysis, as well as to generate hypothesis in the absence of characterization of any close homologs.;In this dissertation, I describe Effusion, our attempt to automate sequence similarity network analysis and improve on the current methods for the prediction of protein function. Effusion modernizes the classical BLAST-based approach while avoiding pitfalls common to state-of-the-art methods. It uses a sequence similarity network to add context for homology transfer, a probabilistic model to account for the uncertainty in labels and function propagation, and the structure of the Gene Ontology to best utilize sparse input labels and make consistent output predictions. Effusion's model makes it practical to integrate rare experimental data with the abundant primary sequence and sequence similarity data. Our model allows for inference with general purpose, state-of-the-art inference algorithms, makes use of all experimental annotation data, has parameters specific to each Gene Ontology term, and adds data-derived pseudocounts to predict rare terms.;Effusion GCA extends Effusion by integrating the chief components necessary for automating genomic context analysis. It performs its analysis over a sequence similarity -- functional association network, with a model of protein function that includes a representation of each protein's biological process, performs simultaneous inference on multiple aspects of protein function, and only propagates functional information where it is appropriate.;We assessed our methods using a critical evaluation method and metrics. The results show that Effusion outperforms standard prediction methods, the most similar prediction methods, and state-of-the-art prediction methods. Effusion GCA does not perform as well as Effusion in aggregate, but offered several other insights. We conclude that these methods represent a significant progress in the field of protein function prediction, and clearly suggest avenues for further advance.
机译:已知蛋白质序列的数量增长速度快于整理后的蛋白质功能的数量。为了弥补这一差距,生物信息学科学家创造了用于预测蛋白质功能的自动化方法。最近,焦点集中在集成大量数据源上,对这些方法的严格评估表明,集成方法可以提高预测性能。但是,基于BLAST的基本方法仍然是最有力的竞争者。计算生物学家经常使用两种互补的方法来推断通常比基于BLAST的方法更准确的功能。序列相似性网络的分析可以剖析超家族中的蛋白质功能,并推断单个蛋白质的功能。简而言之,计算生物学家将在序列空间中创建一个蛋白质网络,该网络通常显示相似蛋白质的簇。然后,她将强调这些蛋白质中只有少数具有实验功能注释,并根据广泛可用的其他功能特征(例如比对中关键位置的残基)绘制网络。这些数据用于识别可能发生功能更改的蛋白质,然后可以用来描绘共享特定功能或功能特性的蛋白质家族或其他蛋白质组。但是,分子功能注释数据非常稀缺,没有足够的数据来高信度地绘制功能边界。第二种方法,即基因组背景分析,通常是与序列相似性网络分析结合使用的。这种方法使用有关蛋白质的基因组邻居的数据,或更普遍地,使用任何功能关联数据(例如蛋白质-蛋白质相互作用数据)来预测蛋白质的分子功能。该技术已被用于完善序列相似性网络分析过程中的功能边界,并在没有任何紧密同源物表征的情况下生成假设。本论文中,我描述了Effusion,这是我们试图自动化序列相似性网络分析和改进的尝试关于目前预测蛋白质功能的方法。 Effusion使传统的基于BLAST的方法实现了现代化,同时避免了最新方法所常见的陷阱。它使用序列相似性网络添加用于同源性转移的上下文,使用概率模型解决标签和功能传播的不确定性,并使用基因本体的结构来最佳利用稀疏输入标签并做出一致的输出预测。 Effusion的模型使得将稀有的实验数据与丰富的一级序列和序列相似性数据进行整合成为现实。我们的模型可以使用通用的最新推理算法进行推理,利用所有实验性注释数据,具有每个基因本体术语专有的参数,并添加数据派生的伪计数来预测稀有术语。通过集成自动化基因组上下文分析所需的主要组件来扩展Effusion。它通过序列相似性-功能关联网络进行分析,并具有蛋白质功能模型,该模型包含每种蛋白质的生物学过程的表示,同时对蛋白质功能的多个方面进行推断,并且仅在适当的地方传播功能信息。 ;我们使用重要的评估方法和指标评估了我们的方法。结果表明,Effusion的性能优于标准预测方法,最相似的预测方法和最新的预测方法。积液GCA的总体表现不如积液,但提供了其他一些见解。我们得出的结论是,这些方法代表了蛋白质功能预测领域的重大进展,并清楚地表明了进一步发展的途径。

著录项

  • 作者

    Yunes, Jeffrey Michael.;

  • 作者单位

    University of California, San Francisco.;

  • 授予单位 University of California, San Francisco.;
  • 学科 Bioinformatics.
  • 学位 Ph.D.
  • 年度 2018
  • 页码 156 p.
  • 总页数 156
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号