首页> 美国卫生研究院文献>Frontiers in Genetics >The use of semantic similarity measures for optimally integrating heterogeneous Gene Ontology data from large scale annotation pipelines
【2h】

The use of semantic similarity measures for optimally integrating heterogeneous Gene Ontology data from large scale annotation pipelines

机译:使用语义相似性度量来最佳地集成来自大规模注释管道的异构基因本体数据

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

With the advancement of new high throughput sequencing technologies, there has been an increase in the number of genome sequencing projects worldwide, which has yielded complete genome sequences of human, animals and plants. Subsequently, several labs have focused on genome annotation, consisting of assigning functions to gene products, mostly using Gene Ontology (GO) terms. As a consequence, there is an increased heterogeneity in annotations across genomes due to different approaches used by different pipelines to infer these annotations and also due to the nature of the GO structure itself. This makes a curator's task difficult, even if they adhere to the established guidelines for assessing these protein annotations. Here we develop a genome-scale approach for integrating GO annotations from different pipelines using semantic similarity measures. We used this approach to identify inconsistencies and similarities in functional annotations between orthologs of human and Drosophila melanogaster, to assess the quality of GO annotations derived from InterPro2GO mappings compared to manually annotated GO annotations for the Drosophila melanogaster proteome from a FlyBase dataset and human, and to filter GO annotation data for these proteomes. Results obtained indicate that an efficient integration of GO annotations eliminates redundancy up to 27.08 and 22.32% in the Drosophila melanogaster and human GO annotation datasets, respectively. Furthermore, we identified lack of and missing annotations for some orthologs, and annotation mismatches between InterPro2GO and manual pipelines in these two proteomes, thus requiring further curation. This simplifies and facilitates tasks of curators in assessing protein annotations, reduces redundancy and eliminates inconsistencies in large annotation datasets for ease of comparative functional genomics.
机译:随着新的高通量测序技术的发展,全球基因组测序项目的数量有所增加,这产生了人类,动植物的完整基因组序列。随后,一些实验室专注于基因组注释,这主要是通过使用基因本体论(GO)术语将功能分配给基因产物。结果,由于不同管道用于推断这些注释的方法不同以及GO结构本身的性质,整个基因组注释的异质性也增加了。即使策展人遵守评估这些蛋白质注释的既定指南,这也使策展人的工作变得困难。在这里,我们开发了一种使用语义相似性度量来整合来自不同管道的GO注释的基因组规模方法。我们使用这种方法来识别人类和果蝇直系同源物之间功能注释之间的不一致和相似之处,以评估与从FlyBase数据集和人类获得的果蝇蛋白质组的手动注释GO注释相比,从InterPro2GO映射衍生的GO注释的质量,以及过滤这些蛋白质组的GO注释数据。获得的结果表明,GO注释的有效集成分别消除了果蝇和人GO注释数据集中的冗余,分别达到27.08%和22.32%。此外,我们发现某些直向同源物缺少注释,以及这两个蛋白质组中InterPro2GO和手动管道之间的注释不匹配,因此需要进一步的处理。这简化并促进了策展人在评估蛋白质注释中的任务,减少了冗余,并消除了大型注释数据集中的不一致性,从而便于比较功能基因组学。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号