首页> 美国卫生研究院文献>other >Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences
【2h】

Automatic Identification of Highly Conserved Family Regions and Relationships in Genome Wide Datasets Including Remote Protein Sequences

机译:自动识别高度保守的家族区域及其在全基因组数据集中的关系包括远程蛋白质序列

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences.
机译:沿着氨基酸序列鉴定共享序列片段通常需要收集紧密相关的蛋白质,通常是从序列数据集中手动进行整理以适合手头的目的。但是,当集合中包含与其余序列的比对较差的远程序列或包含多个域的序列时,当前开发的统计方法就显得格外紧张。在本文中,我们提出了一种完全无监督和自动化的方法,可以结合序列比对,残基保守性评分,来鉴定在蛋白质序列的不同集合中观察到的共享序列片段,包括在集合中较小序列中存在的片段和图论方法。由于共享的序列片段通常暗指保守的功能或结构属性,因此该方法在序列和已鉴定的保守区域之间生成了一张关联表,可以显示以前未知的蛋白质家族以及现有成员的新成员。我们通过在黄金标准数据集中对蛋白质进行聚类并与文献中的先前方法进行比较来评估聚类性能,从而评估了该方法的生物学相关性。然后,我们将所提出的方法应用于17793种人类蛋白质的全基因组数据集,并生成了与4575个已鉴定保守区中的每个保守区的全局关联图。对主要保护区的研究表明,它们与注释的结构域强烈对应。这表明该方法可用于预测蛋白质序列上的新结构域。

著录项

  • 期刊名称 other
  • 作者

    Tunca Doğan; Bilge Karaçalı;

  • 作者单位
  • 年(卷),期 -1(8),9
  • 年度 -1
  • 页码 e75458
  • 总页数 15
  • 原文格式 PDF
  • 正文语种
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号