首页> 外文学位 >Pattern Discovery in DNA Sequences.
【24h】

Pattern Discovery in DNA Sequences.

机译:DNA序列中的模式发现。

获取原文
获取原文并翻译 | 示例

摘要

A pattern is a relatively short sequence that represents a phenomenon in a set of sequences. Not all short sequences are patterns; only those that are statistically significant are referred to as patterns or motifs. Pattern discovery methods analyze sequences and attempt to identify and characterize meaningful patterns. This thesis extends the application of pattern discovery algorithms to a new problem domain - Single Nucleotide Polymorphism (SNP) classification.;SNPs are single base-pair (bp) variations in the genome, and are probably the most common form of genetic variation. On average, one in every thousand bps may be an SNP. The function of most SNPs, especially those not associated with protein sequence changes, remains unclear. However, genome-wide linkage analyses have associated many SNPs with disorders ranging from Crohn's disease, to cancer, to quantitative traits such as height or hair color. As a result, many groups are working to predict the functional effects of individual SNPs. In contrast, very little research has examined the causes of SNPs: Why do SNPs occur where they do?;This thesis addresses this problem by using pattern discovery algorithms to study DNA non-coding sequences. The hypothesis is that short DNA patterns can be used to predict SNPs. For example, such patterns found in the SNP sequence might block the DNA repair mechanism for the SNP, thus causing SNP occurrence. In order to test the hypothesis, a model is developed to predict SNPs by using pattern discovery methods. The results show that SNP prediction with pattern discovery methods is weak (50+/-2%), whereas machine learning classification algorithms can achieve prediction accuracy as high as 68%. To determine whether the poor performance of pattern discovery is due to data characteristics (such as sequence length or pattern length) or to the specific biological problem (SNP prediction), a survey was conducted by profiling eight representative pattern discovery methods at multiple parameter settings on 6,754 real biological datasets. This is the first systematic review of pattern discovery methods with assessments of prediction accuracy, CPU usage and memory consumption. It was found that current pattern discovery methods do not consider positional information and do not handle short sequences well ( less than 150 bps), including SNP sequences.;Therefore, this thesis proposes a new supervised pattern discovery classification algorithm, referred to as Weighted-Position Pattern Discovery and Classification (WPPDC). The WPPDC is able to exploit positional information to identify positionally-enriched motifs, and to select motifs with a high information content for further classification. Tree structure is applied to WPPDC (referred to as T-WPPDC) in order to reduce algorithmic complexity. Compared to pattern discovery methods T-WPPDC not only showed consistently superior prediction accuracy and but generated patterns with positional information. Machine-learning classification methods (such as Random Forests) showed comparable prediction accuracy. However, unlike T-WPPDC, they are classification methods and are unable to generate SNP-associated patterns.
机译:模式是一个相对较短的序列,代表一组序列中的现象。并非所有的短序列都是模式。只有那些具有统计学意义的才被称为图案或图案。模式发现方法分析序列,并尝试识别和表征有意义的模式。本文将模式发现算法的应用扩展到一个新的问题域-单核苷酸多态性(SNP)分类。SNP是基因组中的单碱基对(bp)变异,可能是最常见的遗传变异形式。平均而言,每千个bps中就有一个是SNP。大多数SNP的功能,尤其是与蛋白质序列变化无关的SNP,尚不清楚。但是,全基因组连锁分析已将许多SNP与克罗恩病,癌症,定量特征(例如身高或头发颜色)等疾病相关。结果,许多小组正在努力预测单个SNP的功能作用。相比之下,很少有研究检查SNP的原因:SNP为什么会在它们发生的地方发生?本论文通过使用模式发现算法研究DNA非编码序列来解决此问题。假设是短的DNA模式可用于预测SNP。例如,在SNP序列中发现的这种模式可能会阻止SNP的DNA修复机制,从而导致SNP的出现。为了检验假设,开发了一种通过使用模式发现方法预测SNP的模型。结果表明,使用模式发现方法进行SNP预测的能力较弱(50 +/- 2%),而机器学习分类算法的预测精度可高达68%。为了确定模式发现性能不佳是由于数据特征(例如序列长度或模式长度)还是特定的生物学问题(SNP预测)所致,通过在8种参数设置下对8种代表性模式发现方法进行概要分析来进行调查6,754个真实的生物学数据集。这是对模式发现方法的首次系统综述,其中包括对预测准确性,CPU使用率和内存消耗的评估。发现目前的模式发现方法没有考虑位置信息,不能很好地处理短序列(小于150 bps),包括SNP序列。因此,本文提出了一种新的监督模式发现分类算法,称为加权算法。位置模式发现和分类(WPPDC)。 WPPDC能够利用位置信息来识别位置丰富的图案,并选择具有较高信息含量的图案以进行进一步分类。为了降低算法复杂性,将树结构应用于WPPDC(称为T-WPPDC)。与模式发现方法相比,T-WPPDC不仅显示出始终如一的卓越预测精度,而且还生成了带有位置信息的模式。机器学习分类方法(例如随机森林)显示出可比的预测准确性。但是,与T-WPPDC不同,它们是分类方法,无法生成SNP相关模式。

著录项

  • 作者

    Yan, Rui.;

  • 作者单位

    University of Toronto (Canada).;

  • 授予单位 University of Toronto (Canada).;
  • 学科 Computer science.;Bioinformatics.
  • 学位 Ph.D.
  • 年度 2012
  • 页码 185 p.
  • 总页数 185
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号