首页> 外文会议>International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics >A Quantitative and Qualitative Characterization of k-mer Based Alignment-Free Phylogeny Construction
【24h】

A Quantitative and Qualitative Characterization of k-mer Based Alignment-Free Phylogeny Construction

机译:基于K-MER基于对齐系统的定量和定性表征

获取原文

摘要

The rapidly growing volume of genomic data, including pathogens, both invites exploration of possible phylogenetic relationships among unclassified organisms, and challenges standard techniques that require multiple sequence alignment. Further, the ability to probe variations in selection pressure e.g. among viral outbreaks, is an important characterization of the life of a virus in its biological reservoir. In this paper, we derived the probability distribution of k-mer alignment lengths between random sequences for a given optimized score to quantify the probability that a given alignment was not better than chance, and applied it to Human Papiloma Virus (HPV), primate mtDNA, and Ebola. Even for highly variable HPV types, the number of k-mers required to significantly distinguish an alignment of related genomes from random sequences was reduced from 64 for 1-mers to 6 for 3-mers and 4 for 4-mers, indicating k-mers provide sufficient specificity to be able to characterize differences in sequences by their k-mer frequencies, allowing distances based on the k-mer frequencies to proxy for evolutionary distance. We computed mtDNA coding sequence and Ebola phylogeny construction. Primate mtDNA coding region k-mer UPGMA phylogenies reproduced most of the expected primate phylogeny. The Mantel test, applied to RAxML and Bayesian phylogenetic distances between Ebola samples versus 3-mer frequency distances, was highly significant (≤1 × 10~(-5)). We characterized differences in selection pressure between coding and non-coding regions, and of selection in early cell cycle vs. late genes in Ebola. Coding versus non-coding regions showed evidence of purifying selection, while the early vs. late cell cycle proteins showed differences with late cycle proteins resembling influenza like immunological response, noting the g-proteins are among the late genes.
机译:基因组数据(包括病原体)的快速增长的基因组数据量涉及未经分类的生物体中可能的系统发育关系的探讨,以及需要多个序列对准的标准技术的挑战。此外,探测选择压力变化的能力例如。在病毒爆发中,是其生物储层中病毒的生命的重要表征。在本文中,我们衍生出给定优化得分之间的随机序列之间的k-mer对准长度的概率分布,以量化给定对准并不优于机会的概率,并将其应用于人纸皮纸病毒(HPV),灵长类动物mTDNA和埃博拉。即使对于高度可变的HPV类型,显着区别相关基因组的k-MER的数量也从随机序列中的64种减少了6个,对于3mers,4次,4-Mers,表明K-MERS提供足够的特异性以能够通过其K-MER频率表征序列的差异,允许基于K-MER频率的距离进行进化距离。我们计算了MTDNA编码序列和埃博拉文学构建。灵长类动物MTDNA编码区K-MEM UPGMA文学生成可复制大多数预期的灵长类动物系统。应用于涡旋样品与3-MER频率距离之间的raxmL和贝叶斯系统发育距离的Mantel试验非常显着(≤1×10〜(-5))。我们在埃博拉中的早期细胞周期与晚期基因中的编码和非编码区中的选择压力和选择的差异。编码与非编码区显示出纯化选择的证据,而早期的细胞周期蛋白显示出与类似流感的后循环蛋白质的差异,类似于免疫应答,注意到G-蛋白是晚期基因中的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号