首页> 外文学位 >Computational analysis of thermal denaturation differences and prediction of coding and non-coding eukaryotic DNA sequences.
【24h】

Computational analysis of thermal denaturation differences and prediction of coding and non-coding eukaryotic DNA sequences.

机译:热变性差异的计算分析以及编码和非编码真核DNA序列的预测。

获取原文
获取原文并翻译 | 示例

摘要

In recent years, the exponential growth in genetic sequence data offers an unprecedented opportunity for a new understanding of biology, as well as, many great challenges in the utilization of that sequence data. Computational analysis has become essential in the investigation of genetic DNA sequences, from their biophysical properties to the information encoded in these sequences.; In Chapter Two of this dissertation, I used computational modeling and statistical analysis to investigate the thermal denaturation (melting) of eukaryotic DNA sequences in terms of the relationship between the melting temperature (Tm) and the base and sequence content in different regions of sequences. Using the program, MELTSIM, which simulates DNA melting based upon a nearest neighbor thermodynamic model, I demonstrated that the Tm vs. FGC (mole fraction of the bases G and C) relationships in coding and non-coding DNAs are both linear but have a statistically significant difference (6.6%) in their slopes. By comparing these results to the simulation results from various base shufflings of the original DNAs and the average nearest neighbor frequencies of those natural sequences across the FGC range, I showed that these differences in the Tm vs. FGC relationships are a direct result of systematic FGC-dependent biases in nearest neighbor frequencies for the coding and non-coding DNA classes. Those differences in the Tm vs. FGC relationships and biases in nearest neighbor frequencies also appear but are of smaller magnitudes between the DNA sequences from multicellular and unicellular organisms in the same coding or non-coding classes.; Chapter Three of this dissertation explores the application of biases in oligonucleotide frequencies of DNA regions measured by a biologically-relevant 3-base repeating frame along the DNA as inputs to neural networks (NNs) and support vector machines (SVMs) to predict coding or non-coding class for any DNA sequence. Using three public standard sequence datasets comprised of coding and non-coding DNA sequences, I tested the application to coding versus non-coding classification of the 3-base repeating frame calculated mono-, di-, and tri-nucleotide frequencies represented as matrix elements (3 x 4, 3 x 16, and 3 x 64 matrices). These frequencies were calculated by three different functions for three different sequence lengths (54, 108, and 162 base pairs). Overall, the prediction accuracy increases when the sequence length and the size of the 3-base frame frequency matrix increases. The highest total correct prediction numbers in both of the methods in the different sequence length conditions are relatively high, from about 77% to 98%. The NN method gave relatively high values of the sensitivity of the prediction, from 66% to 95%, but lower values of the specificity, from 35% to 66%, on the three sequence datasets. Based on the results from one dataset being tested, the SVM method showed a significant improvement in the prediction accuracy over the NN method (from 10 to 25% improvement in the correlation coefficient value). The implication of the 3-base frame dependent oligonucleotide frequencies as coding measures and the application of NNs and SVMs in the coding-noncoding prediction problem are discussed.
机译:近年来,遗传序列数据的指数增长为生物学的新认识提供了前所未有的机会,并且为利用该序列数据带来了许多巨大挑战。从遗传生物序列的生物物理特性到这些序列中编码的信息,计算分析已成为研究遗传DNA序列的重要手段。在本论文的第二章中,我使用计算模型和统计分析方法,根据融解温度( T m )之间的关系研究了真核DNA序列的热变性(融解)。斜体>)以及序列不同区域中的碱基和序列内容。使用MELTSIM程序,该程序基于最近邻的热力学模型模拟DNA融解,证明了 T m F GC (碱基G和C的摩尔分数)关系都是线性的,但其斜率具有统计学上的显着差异(6.6%)。通过将这些结果与原始DNA各种碱基改组的模拟结果以及这些自然序列在 F GC 范围内的平均最近邻居频率的模拟结果进行比较,我发现这些 T m F GC 关系的差异是系统性 F GC 依赖的编码和非编码DNA类的最近邻频率偏倚。在 T m F GC 关系中的那些差异以及最近邻居频率中的偏差也出现了,但具有在相同编码或非编码类别中,多细胞和单细胞生物的DNA序列之间的幅度较小;本论文的第三章探讨了在生物学上相关的3碱基重复框沿DNA作为神经网络(NNs)和支持向量机(SVMs)的输入来预测编码或非编码的偏见在DNA区域寡核苷酸频率中的偏差的应用。 DNA序列的编码类。使用由编码和非编码DNA序列组成的三个公共标准序列数据集,我测试了以3碱基重复帧计算的以矩阵元素表示的单核苷酸,二核苷酸和三核苷酸频率在编码和非编码分类中的应用(3 x 4、3 x 16和3 x 64矩阵)。这些频率是通过三个不同的函数针对三个不同的序列长度(54、108和162个碱基对)计算得出的。总体而言,当3个基本帧频率矩阵的序列长度和大小增加时,预测精度会提高。两种方法在不同序列长度条件下的最高总正确预测数相对较高,从约77%到98%。 NN方法在三个序列数据集上给出了相对较高的预测灵敏度值,从66%到95%,但是较低的特异性值从35%到66%。根据一个测试数据集的结果,SVM方法显示出的预测准确性比NN方法有了显着提高(相关系数值提高了10%到25%)。讨论了基于3个碱基的寡核苷酸频率作为编码手段的含义以及NN和SVM在编码-非编码预测问题中的应用。

著录项

  • 作者

    Long, Dang Duc.;

  • 作者单位

    University of Massachusetts Lowell.;

  • 授予单位 University of Massachusetts Lowell.;
  • 学科 Biology Molecular.; Chemistry Biochemistry.
  • 学位 Ph.D.
  • 年度 2003
  • 页码 127 p.
  • 总页数 127
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 分子遗传学;生物化学;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号