首页> 外文学位 >Algorithms for large scale DNA copy number data.
【24h】

Algorithms for large scale DNA copy number data.

机译:大规模DNA拷贝数数据的算法。

获取原文
获取原文并翻译 | 示例

摘要

High-throughput array-based assays have recently been developed to detect DNA copy number (DCN) aberrations. Identifying DCN aberrations is highly important for finding tumor suppressor genes and oncogenes. But the DCN data from these arrays is characterized by high levels of noise and unequal spacing of the probes on the genome.;There are several types of methods suggested to analyse DCN data. One type is denoising and smoothing approaches, which try to reduce the noise in the data. The other type is segmentation approaches, which try to identify the chromosomal segments with copy number aberrations.;Then a novel stationary wavelet denoising scheme based on interpolation for DCN data is developed. Empirical results on synthetic data showed that our method outperformed the best previously proposed wavelet denoising method by 4.6% – 12.7% as measured in the root mean squared error. Experiments on a real data set also confirmed the applicability of our method to real DCN data.;After that, a novel model-based method using the minimum description length (MDL) principle for DCN data segmentation is developed. The tumor sample is often contaminated by normal cells. The goal of computational analysis of array-based DCN data is to infer the underlying DCNs from raw DCN data. Previous methods for this task don't model the tumor/normal cell mixture ratio explicitly. Our new method can output underlying DCN for each chromosomal segment, and at the same time, infer the underlying tumor proportion in the test samples. Empirical results show that our method achieves 40% to 60% decrease in misclassification rate on average as compared to two previous methods, namely Circular Binary Segmentation and Hidden Markov Model.;HMM is a good model to parse noisy data with hidden states. It is already really successful in speech recognition and shape identification. So it is highly potential to be effective to process DCN data with high noise level. We proposed a Gaussina mixture hidden markov model (GMHMM) method to divide noisy DCN data into loss, normal and gain three states. Our GMHMM is proved to be more accurate in classification rate than CBS, previous HMM and ultrasome on both synthetic data and real data.;After we did preliminary analysis on DCN data, the further step is to do data mining on the data. For example, cancer classification according to the features of the data, applying gene ontology on the data to retrieve the meaning of the data. In order to do cancer classification on DCN data, we introduced a method using optimization of interval thresholds to do cancer classification analysis. The underlying function in aCGH DNA copy number data is a piece-wise constant square-wave function. So instead of using all the probes, less number of features can be used to represent the DNA copy number data. In this way, we avoided the "curse-of-dimensionality" problem. Better classification accuracies and P values are obtained by using intervals as features than using probes as features.;Gene ontology is a controlled vocabulary created to describe genes' functions. There are many web tools to find the biological interpretation of an interesting gene list in the context of the Gene Ontology based on Fisher's exact test, such as EASE, GoMiner etc. They require the user to select a list of significantly disregulated genes from the whole list of genes on a microarray. This gene selection step can be difficult due to potentially inaccurate P-value estimation after multiple testing correction. After applying t-tests on a whole gene set on a microarray then ranking according to P-values, we developed a novel method to combine P-values; eliminating the need for a gene-selection step. We were able to obtain better results than we could with EASE as measured by comparing the receiver-operating characteristic curves.
机译:最近已开发出基于高通量阵列的检测方法来检测DNA拷贝数(DCN)畸变。识别DCN畸变对于寻找肿瘤抑制基因和癌基因非常重要。但是,来自这些阵列的DCN数据的特征是高水平的噪声和基因组上探针的间距不相等。;建议使用几种类型的方法来分析DCN数据。一种类型是去噪和平滑方法,它们试图减少数据中的噪声。另一种类型是分割方法,试图识别具有拷贝数像差的染色体片段。然后,提出了一种基于内插的DCN数据平稳小波去噪新方案。综合数据的经验结果表明,以均方根误差衡量,我们的方法比先前提出的最佳小波去噪方法好4.6%-12.7%。在真实数据集上进行的实验也证实了我们的方法对真实DCN数据的适用性。此后,开发了一种基于最小描述长度(MDL)原理的DCN数据分割的基于模型的新方法。肿瘤样品经常被正常细胞污染。对基于阵列的DCN数据进行计算分析的目的是从原始DCN数据中推断基础DCN。用于此任务的先前方法并未明确建立肿瘤/正常细胞混合比的模型。我们的新方法可以为每个染色体片段输出潜在的DCN,并同时推断测试样品中潜在的肿瘤比例。实证结果表明,与圆形二值分割和隐马尔可夫模型这两种方法相比,我们的方法平均可减少40%至60%的误分类率。HMM是一种很好的模型,可以分析带有隐藏状态的噪声数据。它已经在语音识别和形状识别方面非常成功。因此,有效处理高噪声水平的DCN数据具有很大的潜力。我们提出了一种高斯纳混合隐马尔可夫模型(GMHMM)方法,将嘈杂的DCN数据分为损耗,正常和增益三个状态。在合成数据和真实数据上,我们的GMHMM被证明比CBS,先前的HMM和超微粒体具有更高的分类率。在对DCN数据进行了初步分析之后,下一步是对数据进行数据挖掘。例如,根据数据的特征对癌症进行分类,对数据应用基因本体以检索数据的含义。为了在DCN数据上进行癌症分类,我们介绍了一种使用间隔阈值优化进行癌症分类分析的方法。 CGH DNA拷贝数数据中的基本功能是分段恒定方波功能。因此,代替使用所有探针,可以使用更少的特征来表示DNA拷贝数数据。这样,我们避免了“维数诅咒”问题。使用间隔作为特征比使用探针作为特征可以获得更好的分类准确性和P值。基因本体是创建用于描述基因功能的受控词汇。有很多网络工具可以根据Fisher的精确测试在基因本体论的背景下找到有趣的基因列表的生物学解释,例如EASE,GoMiner等。它们需要用户从整体中选择一个明显失调的基因列表。微阵列上的基因列表。由于多次测试校正后的P值估计可能不准确,因此此基因选择步骤可能很困难。对微阵列上的整个基因进行t检验,然后根据P值进行排名后,我们开发了一种组合P值的新方法;消除了基因选择步骤的需要。通过比较接收机工作特性曲线,我们能够获得比使用EASE更好的结果。

著录项

  • 作者

    Wang, Siling.;

  • 作者单位

    Southern Methodist University.;

  • 授予单位 Southern Methodist University.;
  • 学科 Biology Bioinformatics.
  • 学位 Ph.D.
  • 年度 2012
  • 页码 129 p.
  • 总页数 129
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号