首页> 外文学位 >GARBASE---a database of wrongly annotated proteins.
【24h】

GARBASE---a database of wrongly annotated proteins.

机译:GARBASE-错误标注蛋白质的数据库。

获取原文
获取原文并翻译 | 示例

摘要

One of the many problems that exist in publicly available sequence database is the presence of wrongly annotated genes. These publicly available sequences and the associated annotation are used in computational methods involved in predicting genes on newly sequenced genomes. Such gene prediction is based on homology to previously annotated genes. Since the wrongly annotated genes in the past are also supported by homology, it results in continual propagation of wrong annotation, consequently affecting any homology based annotation of a newly sequenced genome. The objective of this project is to establish a balancing database of peptide sequences that have been called proteins, or part of proteins, but have been identified or reported to be not true. Here we report the development of a computational approach to collect evidence that can be used to determine confidence score for annotation of a protein. Using this framework and biological properties of proteins, namely the presence of conserved domain and gene order conservation, we have analyzed 85259 proteins from 26 Mycobacterium-genomes. The result from this analysis is populated into a prototype database (GARBASE), which consists of 19484 proteins that are potentially annotated incorrectly. Additionally, work is underway to populate this database with the results from the analysis of all the available genomes in the public repository, such as the GenBank. This will allow GARBASE to be a useful resource, when integrated into an automated genome annotation pipeline.;Keywords: Protein annotation, database, conserved domain, gene order, high performance computing.
机译:可公开获得的序列数据库中存在的许多问题之一是存在注释错误的基因。这些可公开获得的序列和相关注释用于预测新测序基因组中基因的计算方法。这样的基因预测是基于与先前注释的基因的同源性。由于同源性也支持过去错误注释的基因,因此会导致错误注释的持续传播,从而影响新测序基因组的任何基于同源性的注释。该项目的目的是建立一个肽序列的平衡数据库,该肽序列被称为蛋白质或蛋白质的一部分,但已被鉴定或报告为不真实。在这里,我们报告一种计算方法的发展,以收集可用于确定蛋白质注释的置信度得分的证据。利用这种框架和蛋白质的生物学特性,即存在保守域和基因顺序保守性,我们分析了来自26个分枝杆菌基因组的85259种蛋白质。该分析的结果被填充到原型数据库(GARBASE)中,该数据库包含19484个可能被错误注释的蛋白质。此外,正在进行工作,使用对公共存储库(如GenBank)中所有可用基因组的分析结果来填充此数据库。当将GARBASE集成到自动化的基因组注释管道中时,这将使GARBASE成为有用的资源。关键词:蛋白质注释,数据库,保守域,基因顺序,高性能计算。

著录项

  • 作者

    Pandey, Sanjit.;

  • 作者单位

    University of Nebraska at Omaha.;

  • 授予单位 University of Nebraska at Omaha.;
  • 学科 Bioinformatics.;Computer science.
  • 学位 M.S.
  • 年度 2010
  • 页码 80 p.
  • 总页数 80
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号