首页> 美国卫生研究院文献>Elsevier Public Health Emergency Collection >A novel clustering method via nucleotide-based Fourier power spectrum analysis
【2h】

A novel clustering method via nucleotide-based Fourier power spectrum analysis

机译:基于核苷酸的傅立叶功率谱分析的新型聚类方法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

A novel clustering method is proposed to classify genes or genomes. This method uses a natural representation of genomic data by binary indicator sequences of each nucleotide (adenine (A), cytosine (C), guanine (G), and thymine (T)). Afterwards, the discrete Fourier transform is applied to these indicator sequences to calculate spectra of the nucleotides. Mathematical moments are calculated for each of these spectra to create a multidimensional vector in a Euclidean space for each gene or genome sequence. Thus, each gene or genome sequence is realized as a geometric point in the Euclidean space. Finally, pairwise Euclidean distances between these points (i.e. genome sequences) are calculated to cluster the gene or genome sequences. This method is applied to three sets of data. The first is 34 strains of coronavirus genomic data, the second is 118 of the known strains of Human rhinovirus (HRV), and the third is 30 bacteria genomes. The distance matrices are computed based on the three sets, showing the distances from each point to the others. We used the complete linkage clustering algorithm to build phylogenetic trees to indicate how the distances among these sequence correspond to the evolutionary relationship among these sequences. This genome representation provides a powerful and efficient method to classify genomes and is much faster than the widely acknowledged multiple sequence alignment method.
机译:提出了一种新颖的聚类方法来对基因或基因组进行分类。该方法通过每个核苷酸(腺嘌呤(A),胞嘧啶(C),鸟嘌呤(G)和胸腺嘧啶(T))的二元指示序列使用基因组数据的自然表示。然后,将离散傅立叶变换应用于这些指示剂序列以计算核苷酸的光谱。计算这些谱中的每一个的数学矩,以在欧几里德空间中为每个基因或基因组序列创建多维向量。因此,每个基因或基因组序列都被实现为欧几里得空间中的一个几何点。最后,计算这些点(即基因组序列)之间的成对欧几里得距离以将基因或基因组序列聚类。此方法适用于三组数据。第一个是冠状病毒基因组数据的34株,第二个是人类鼻病毒(HRV)已知株的118个,第三个是30个细菌基因组。距离矩阵基于这三个集合进行计算,显示了从每个点到另一个点的距离。我们使用完整的链接聚类算法构建系统发育树,以指示这些序列之间的距离如何对应于这些序列之间的进化关系。这种基因组表示法提供了一种强大而有效的方法来对基因组进行分类,并且比公认的多序列比对方法要快得多。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号