首页> 外文期刊>BMC Genomics >Reliability and applications of statistical methods based on oligonucleotide frequencies in bacterial and archaeal genomes
【24h】

Reliability and applications of statistical methods based on oligonucleotide frequencies in bacterial and archaeal genomes

机译:基于寡核苷酸频率的细菌和古细菌基因组统计方法的可靠性和应用

获取原文
           

摘要

Background The increasing number of sequenced prokaryotic genomes contains a wealth of genomic data that needs to be effectively analysed. A set of statistical tools exists for such analysis, but their strengths and weaknesses have not been fully explored. The statistical methods we are concerned with here are mainly used to examine similarities between archaeal and bacterial DNA from different genomes. These methods compare observed genomic frequencies of fixed-sized oligonucleotides with expected values, which can be determined by genomic nucleotide content, smaller oligonucleotide frequencies, or be based on specific statistical distributions. Advantages with these statistical methods include measurements of phylogenetic relationship with relatively small pieces of DNA sampled from almost anywhere within genomes, detection of foreign/conserved DNA, and homology searches. Our aim was to explore the reliability and best suited applications for some popular methods, which include relative oligonucleotide frequencies (ROF), di- to hexanucleotide zero'th order Markov methods (ZOM) and 2.order Markov chain Method (MCM). Tests were performed on distant homology searches with large DNA sequences, detection of foreign/conserved DNA, and plasmid-host similarity comparisons. Additionally, the reliability of the methods was tested by comparing both real and random genomic DNA. Results Our findings show that the optimal method is context dependent. ROFs were best suited for distant homology searches, whilst the hexanucleotide ZOM and MCM measures were more reliable measures in terms of phylogeny. The dinucleotide ZOM method produced high correlation values when used to compare real genomes to an artificially constructed random genome with similar %GC, and should therefore be used with care. The tetranucleotide ZOM measure was a good measure to detect horizontally transferred regions, and when used to compare the phylogenetic relationships between plasmids and hosts, significant correlation (R2 = 0.4) was found with genomic GC content and intra-chromosomal homogeneity. Conclusion The statistical methods examined are fast, easy to implement, and powerful for a number of different applications involving genomic sequence comparisons. However, none of the measures examined were superior in all tests, and therefore the choice of the statistical method should depend on the task at hand.
机译:背景技术越来越多的测序原核生物基因组包含大量需要有效分析的基因组数据。存在用于此类分析的一组统计工具,但尚未充分探讨它们的优缺点。我们在这里关注的统计方法主要用于检查古细菌和来自不同基因组的细菌DNA之间的相似性。这些方法将观察到的固定大小寡核苷酸的基因组频率与期望值进行比较,该期望值可以通过基因组核苷酸含量,较小的寡核苷酸频率或基于特定的统计分布来确定。这些统计方法的优点包括测量与从基因组中几乎任何地方采样的相对较小的DNA的系统发育关系,检测外源/保守的DNA以及进行同源性搜索。我们的目的是探索一些流行方法的可靠性和最适合的应用,这些方法包括相对寡核苷酸频率(ROF),二至六核苷酸零阶马尔可夫方法(ZOM)和二阶马尔可夫链方法(MCM)。对具有大DNA序列的远距离同源性搜索,外来/保守DNA的检测以及质粒-宿主相似性比较进行了测试。另外,通过比较真实和随机基因组DNA来测试方法的可靠性。结果我们的发现表明,最佳方法是上下文相关的。 ROF最适合用于远距离同源性搜索,而就系统发育而言,六核苷酸ZOM和MCM措施是更可靠的措施。当使用二核苷酸ZOM方法将真实基因组与具有类似%GC的人工构建的随机基因组进行比较时,产生了很高的相关性,因此应谨慎使用。四核苷酸ZOM量度是检测水平转移区域的良好方法,当用于比较质粒与宿主之间的系统发生关系时,发现与基因组GC含量和内部GC含量之间存在显着相关性(R 2 = 0.4)。 -染色体同质性。结论检验的统计方法对于涉及基因组序列比较的许多不同应用而言是快速,易于实现且功能强大的。但是,在所有测试中,所检查的度量均没有一种是优越的,因此,统计方法的选择应取决于手头的任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号