【24h】

A structural perspective on genome evolution

机译:基因组进化的结构视角

获取原文

摘要

At UCL we have developed several automated protocols for generating protein family resources (CATH; Gene3D). These resources can be used to perform comparative genome analyses in order to understand the evolution of protein families. Also to identify biologically and/or medically interesting families for which no structural data currently exists and which may therefore be important targets for structure genomics initiatives.The CATH domain structure database, established by Orengo and Thornton in 1993, now contains a significant proportion of protein structures from the PDB clustered into 1400 evolutionary families. Relationships have been identified using robust structure comparison methods (SSAP, CATHEDRAL). We have also benchmarked and optimised various 1D-profiles and HMM based protocols for assigning genome sequences to families within the resource (e.g. SAM-T99, SAMOSA, CATH-ISL).In this way we can assign structural data to a large proportion (up to 60%) of whole or partial sequences in completed genomes and 80% of genes coding for enzymes and other proteins in biochemical pathways. However, in order to include all families regardless of whether their structure is known or not, a new protein family resource has been developed (Gene3D). In Gene3D, complete genes have been clustered according to sequence similarity alone, using a robust clustering method (Pfscape). 120 completed genomes from all kingdoms have been clustered into 220,000 gene families, 70,000 of which contain 2 or more sequences. Subsequently, we have labelled those gene families for which CATH structural or Pfam functional domain annotations can be provided for all or part of the gene.Preliminary analysis of the genome annotations reveals that a significant proportion (up to 70%) of CATH annotated genes or gene regions in genomes are assigned to domain families that are common to all three kingdoms of life. However, only 20% of the genome sequences are assigned to gene families common to all kingdoms. Since a large proportionof these genes are multidomain proteins this supports the view that a great deal of functional diversity within the genomes has been achieved by combining domain modules in different ways.In collaboration with Professor Janet Thornton, we have analysed a subset of 56 bacterial genomes to determine the recurrence of specific domain structure families within the genomes. This revealed a small but essential group of universal, and in some cases, highly recurring domain families. For some size-dependent families, domain recurrence is highly correlated with increase in genome size, whilst in other size-independent families no correlation is observed. Statistical analysis allowed us to distinguish three groups. Within the size-dependent families we differentiated two groups: linearly-distributed and non-linearly-distributed. Functional annotation using the COGs revealed that these domains were predominantly involved in metabolism and regulation, respectively. Whilst a third group of Evenly-distributed size independent domains are primarily involved in protein translation and biosynthesis.By mapping CATH and Pfam domains families onto all the genome sequences in Gene3D we observe that a few hundred highly recurrent families are dominating at least 50% of whole or partial genome sequences. Many of these families are common to both prokaryotes and eukaryotes and are performing essential generic functions. In many of the largest families, significant divergence in sequence has been accompanied by modifications in structure and function. Targetting representatives in these families for structure determination will allow the structure genomics initiatives to map both fold and function space and reveal the mechanisms by which divergence in protein families promotes evolution of new functions.
机译:在UCL,我们已经开发了几种用于生成蛋白质家族资源的自动化协议(CATH; Gene3D)。这些资源可用于进行比较基因组分析,以了解蛋白质家族的进化。还可以识别目前尚无结构数据的生物学和/或医学上感兴趣的家族,因此可能是结构基因组计划的重要目标。Orengo和Thornton于1993年建立的CATH结构域数据库现在包含很大比例的蛋白质来自PDB的结构聚集成1400个进化家族。已经使用健壮的结构比较方法(SSAP,CATHEDRAL)确定了关系。我们还对各种1D轮廓和基于HMM的协议进行了基准测试和优化,以将基因组序列分配给资源中的家族(例如SAM-T99,SAMOSA,CATH-ISL)。完整基因组中60%的全部或部分序列)和> 80%的生化途径中编码酶和其他蛋白质的基因。但是,为了包括所有家族而不管其结构是否已知,已经开发了一种新的蛋白质家族资源(Gene3D)。在Gene3D中,已使用健壮的聚类方法(Pfscape)仅根据序列相似性对完整的基因进行了聚类。来自所有王国的120个完整基因组已聚集成220,000个基因家族,其中70,000个包含2个或更多序列。随后,我们标记了可以为全部或部分基因提供CATH结构或Pfam功能域注释的基因家族。对基因组注释的初步分析显示,有相当一部分(高达70%)的CATH注释基因或基因组中的基因区域被分配给生命的三个王国共有的域家族。但是,只有20%的基因组序列分配给所有王国共有的基因家族。由于这些基因中有很大一部分是多域蛋白,这支持以下观点:通过以不同方式组合域模块,可实现基因组内的大量功能多样性。与珍妮特·桑顿教授合作,我们分析了56个细菌基因组的子集确定基因组中特定域结构家族的复发。这揭示了一小部分但必不可少的通用(在某些情况下)是高度重复出现的域家族。对于某些大小相关的家族,结构域的复发与基因组大小的增加高度相关,而在其他大小无关的家族中则没有相关性。统计分析使我们可以区分三类。在大小相关的族中,我们分为两组:线性分布的非线性分布的。使用COG的功能注释显示,这些域分别主要参与代谢和调节。第三组大小均不相关的结构域主要参与蛋白质翻译和生物合成。通过将CATH和Pfam结构域家族映射到Gene3D的所有基因组序列中,我们观察到数百个高度复发的家族至少占全部或部分基因组序列的50%。这些家族中有许多是原核生物和真核生物共同的,并且正在执行基本的通用功能。在许多最大的家族中,序列的显着差异伴随着结构和功能的改变。以这些家族中的代表为目标进行结构确定将使结构基因组学计划既可以绘制折叠和功能空间图,又可以揭示蛋白质家族中的差异促进新功能进化的机制。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号