首页> 外文学位 >Evidence combination in hidden Markov models for gene prediction.
【24h】

Evidence combination in hidden Markov models for gene prediction.

机译:隐马尔可夫模型中的证据组合,用于基因预测。

获取原文
获取原文并翻译 | 示例

摘要

This thesis introduces new techniques for finding genes in genomic sequences. Genes are regions of a genome encoding proteins of an organism. Identification of genes in a genome is an important step in the annotation process after a new genome is sequenced. The prediction accuracy of gene finding can be greatly improved by using experimental evidence. This evidence includes homologies between the genome and databases of known proteins, or evolutionary conservation of genomic sequence in different species.; We propose a flexible framework to incorporate several different sources of such evidence into a gene finder based on a hidden Markov model. Various sources of evidence are expressed as partial probabilistic statements about the annotation of positions in the sequence, and these are combined with the hidden Markov model to obtain the final gene prediction. The opportunity to use partial statements allows us to handle missing information transparently and to cope with the heterogeneous character of individual sources of evidence. On the other hand, this feature makes the combination step more difficult. We present a new method for combining partial probabilistic statements and prove that it is an extension of existing methods for combining complete probability statements. We evaluate the performance of our system and its individual components on data from the human and fruit fly genomes.; The use of sequence evolutionary conservation as a source of evidence in gene finding requires efficient and sensitive tools for finding similar regions in very long sequences. We present a method for improving the sensitivity of existing tools for this task by careful modeling of sequence properties. In particular, we build a hidden Markov model representing a typical homology between two protein coding regions and then use this model to optimize a component of a heuristic algorithm called a spaced seed. The seeds that we discover significantly improve the accuracy and running time of similarity search in protein coding regions; and are directly applicable to our gene finder.
机译:本文介绍了在基因组序列中寻找基因的新技术。基因是编码生物体蛋白质的基因组区域。在对新基因组进行测序后,基因组中基因的鉴定是注释过程中的重要步骤。利用实验证据可以大大提高基因发现的预测准确性。该证据包括已知蛋白质的基因组与数据库之间的同源性,或不同物种中基因组序列的进化保守性。我们提出了一个灵活的框架,可以将几种不同的证据来源整合到基于隐马尔可夫模型的基因发现器中。各种证据来源表示为有关序列中位置注释的部分概率陈述,并将这些证据与隐马尔可夫模型组合在一起以获得最终的基因预测。使用部分陈述的机会使我们能够透明地处理缺失的信息,并应对各个证据来源的异质性。另一方面,此功能使组合步骤更加困难。我们提出了一种组合部分概率陈述的新方法,并证明它是组合完整概率陈述的现有方法的扩展。我们根据人类和果蝇基因组的数据评估系统及其各个组件的性能。使用序列进化保守作为基因发现中的证据来源需要在非常长的序列中寻找相似区域的有效且敏感的工具。我们提出了一种通过对序列属性进行仔细建模来提高现有工具对此任务的敏感性的方法。特别是,我们建立了一个隐马尔可夫模型,表示两个蛋白质编码区域之间的典型同源性,然后使用该模型来优化启发式算法的一个组成部分,即间隔种子。我们发现的种子大大提高了蛋白质编码区相似搜索的准确性和运行时间;并直接适用于我们的基因查找器。

著录项

  • 作者

    Brejova, Bronislava.;

  • 作者单位

    University of Waterloo (Canada).;

  • 授予单位 University of Waterloo (Canada).;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 154 p.
  • 总页数 154
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号