...
首页> 外文期刊>Journal of Bioinformatics and Computational Biology >MULTIVARIATE ENTROPY DISTANCE METHOD FOR PROKARYOTIC GENE IDENTIFICATION
【24h】

MULTIVARIATE ENTROPY DISTANCE METHOD FOR PROKARYOTIC GENE IDENTIFICATION

机译:多元熵距离的原核基因鉴定方法

获取原文
获取原文并翻译 | 示例

摘要

A new simple method is found for efficient and accurate identification of coding sequences in prokaryotic genome. The method employs a Shannon description of artificial language for DNA sequences. It consists in translating a DNA sequence into a pseudo-amino acid sequence with 20 fundamental words according to the universal genetic code. With an entropy-density profile (EDP), the method maps a sequence of finite length to a vector and then analyzes its position in the 20-dimensional phase space depending on its nature. It is found that the ratio of the relative distance to an averaged coding and non-coding EDP over a small number (up to one) of open reading frames (ORFs) can serve as a good coding potential. An iterative algorithm is designed for finding a set of "root" sequences using this coding potential. A multivariate entropy distance (MED) algorithm is then proposed for the identification of prokaryotic genes; it has a feature to combine the use of a coding potential and an EDP-based sequence similarity analysis. The current version of MED is unsupervised, parameter-free and simple to implement. It is demonstrated to be able to detect 95-99% genes with 10-30% of additional genes when tested against the RefSeq database of NCBI and to detect 97.5-99.8% of confirmed genes with known functions. It is also shown to be able to find a set of (functionally known) genes that are missed by other well-known gene finding algorithms. All measurements show that the MED algorithm reaches a similar performance level as the algorithms like GeneMark and Glimmer for prokaryotic gene prediction.
机译:发现了一种新的简单方法,可以有效,准确地鉴定原核基因组中的编码序列。该方法对DNA序列采用人工语言的Shannon描述。它包括根据通用遗传密码将DNA序列翻译成具有20个基本词的伪氨基酸序列。通过熵密度分布图(EDP),该方法将有限长度的序列映射到矢量,然后根据其性质分析其在20维相空间中的位置。已经发现,在少量(最多一个)开放阅读框(ORF)上,相对距离与平均编码和非编码EDP的相对距离之比可以充当良好的编码潜力。设计了一种迭代算法,以使用此编码潜力来找到一组“根”序列。然后提出了一种多元熵距离算法来鉴定原核基因。它具有结合使用编码潜力和基于EDP的序列相似性分析的功能。当前版本的MED是不受监管的,无参数的且易于实现。当针对NCBI的RefSeq数据库进行测试时,已证明能够检测95-99%的基因和10-30%的其他基因,并且能够检测97.5-99.8%的已知功能的已知基因。它还显示出能够找到其他众所周知的基因发现算法所遗漏的一组(功能已知)基因。所有测量结果均表明,MED算法达到了与GeneMark和Glimmer这样的用于原核基因预测的算法相似的性能水平。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号