【2h】

Unique function words characterize genomic proteins

机译:独特的功能词表征基因组蛋白

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant profiles by finding maximum disjoint cliques: Each cluster is replaced by a single representative profile to give what we term a unique function word (UFW). From 2009 to 2016, the number of sequence profiles used by CDART increased by 80%; the number of UFWs increased more slowly by 30%, indicating that the number of UFWs may be saturating. The number of sequences matched by a single UFW (sequences with single domain architectures) increased as slowly as the number of different words, whereas the number of sequences matched by a combination of two or more UFWs in sequences with multiple domain architectures (MDAs) increased at the same rate as the total number of sequences. This combinatorial arrangement of a limited number of UFWs in MDAs accounts for the genomic diversity of protein sequences. Although eukaryotes and prokaryotes use very similar sets of “words” or UFWs (57% shared), the “sentences” (MDAs) are different (1.3% shared).
机译:在2009年至2016年之间,已知物种的蛋白质序列数量从800万增加到8500万,增长了10倍。这些序列中约80%包含至少一个被保守域结构检索工具(CDART)识别为序列基序的区域。母题提供了生物学功能的线索,但CDART经常通过两个或多个配置文件匹配蛋白质的同一区域。这种同义词使功能复杂性的估计复杂化。我们通过找到最大不相交的派系来对冗余配置文件进行全链接聚类:每个聚类被单个代表配置文件代替,以提供所谓的唯一功能词(UFW)。从2009年到2016年,CDART使用的序列图谱数量增加了80%; UFW的数量增加的速度更慢30%,这表明UFW的数量可能会饱和。单个UFW(具有单域结构的序列)匹配的序列数与不同单词的数量一样缓慢地增加,而具有多个域结构(MDA)的序列中两个或多个UFW的组合所匹配的序列数则增加以与序列总数相同的速率。 MDA中有限数量的UFW的这种组合排列说明了蛋白质序列的基因组多样性。尽管真核生物和原核生物使用非常相似的“单词”或“超高频”集(共有57%),但“句子”(MDA)却有所不同(共有1.3%)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号