首页> 外文学位 >A novel probabilistic framework for microarray data analysis: From fundamental probability models to experimental validation.
【24h】

A novel probabilistic framework for microarray data analysis: From fundamental probability models to experimental validation.

机译:用于微阵列数据分析的新型概率框架:从基本概率模型到实验验证。

获取原文
获取原文并翻译 | 示例

摘要

Gene expression studies as currently done with printed arrays or prefabricated gene chips generate large quantities of data that are used to determine the relative expression levels of thousands of genes in a cell. The most compelling characteristic of these data sets is that the number of genes whose expression profiles are to be determined exceeds the number of replicates by several orders of magnitude. Standard spot-by-spot analysis seeks to extract useful information for each gene on the basis of the number of replicates available for the specific gene in question. As has become increasingly clear, this plays to the weakness rather than the strength of microarrays. On the other hand, by virtue of the sheer data volume alone, treating the entire data set as an ensemble, and developing fundamental theoretical distributions to represent these ensembles provides us with a framework for efficient extraction of gene expression information that plays to the strength of microarrays. Relatively little attention has been paid to studying distributions of complete microarray data sets; and virtually all of the published studies are empirical approximations fitted to observed data.; The primary objective of this dissertation is to present fundamental probability models for microarray data distributions that can be used for drawing rigorous statistical inference regarding differential gene expression. In this regard, we have departed from the standard gene-by-gene techniques that rely on ad hoc transformations needed to justify the use of classical statistics, since such techniques play to the weakness of microarray technology. Instead, we consider the entire microarray data set as an ensemble and characterize it as such from first principles.; First, we present theoretical results that confirm what had previously been speculated, or assumed for convenience: that under very reasonable assumptions, the distribution of microarray intensities should follow the Gamma (not lognormal) distribution. It is subsequently established that a polar coordinate transformation of raw intensity data provides the basis for a technique in which each microarray data set is represented as a mixture of Beta densities for the fractional intensities (not intensity ratios), from which rigorous statistical inference may be drawn regarding differential gene expression. Using a Beta mixture model as its theoretical basis, a probabilistic framework for carrying out statistical inference was then developed. The final outcome of the inference is an ordered triplet of results for each gene: (i) a comparative fractional expression level, (ii) an associated probability that this number indicates lower, higher, or no differential expression, and (iii) a measure of confidence associated with the stated result (determined from the variability estimated from replicates, or else by propagation-of-error techniques when there are no replicates).; The application of the probabilistic framework is illustrated via a detailed treatment of experimental data from gene expression studies in Deinococcus radiodurans following DNA damage; the technique was also successfully tested on well known datasets that have been studied thoroughly in the bioinformatics literature using different statistical techniques. Additionally, the probabilistic framework was validated experimentally. The basic steps involved in the validation study were: (i) the analysis of Affymetrix GeneChip array data and the selection of some candidate genes based on high probabilities of expression status and confidence associated with these probabilities, and (ii) the independent characterization of the real expression status (up-regulated, down-regulated or not differentially expressed) of the selected genes using a complementary high-precision, but not high-throughput polonies technology. The results of the probabilistic framework inference showed good agreement with the confirmatory results from the high precision,
机译:目前对印刷阵列或预制基因芯片进行的基因表达研究产生了大量数据,这些数据可用于确定细胞中成千上万个基因的相对表达水平。这些数据集最引人注目的特征是,要确定其表达谱的基因数量比重复数高出几个数量级。标准的逐点分析试图根据可用于特定基因的重复次数为每个基因提取有用的信息。越来越清楚的是,这是微阵列的弱点,而不是力量。另一方面,仅凭纯粹的数据量,将整个数据集视为一个整体,并开发基本的理论分布来表示这些整体,就为我们提供了有效提取基因表达信息的框架,该框架发挥了芯片。相对较少的注意力集中在研究完整微阵列数据集的分布上。几乎所有已发表的研究都是根据观测数据得出的经验近似值。本文的主要目的是提出微阵列数据分布的基本概率模型,该模型可用于绘制关于差异基因表达的严格统计推断。在这方面,我们已经脱离了标准的逐个基因技术,该技术依赖于证明使用经典统计数据所必需的即席转换,因为此类技术在微阵列技术中起着弱点。相反,我们将整个微阵列数据集视为一个整体,并根据第一原理对其进行表征。首先,我们提供理论结果,以确认先前的推测或为方便起见:在非常合理的假设下,微阵列强度的分布应遵循Gamma(而非对数正态)分布。随后确定原始强度数据的极坐标变换为该技术提供了基础,在该技术中,每个微阵列数据集均表示为分数强度(不是强度比)的Beta密度的混合物,从中可以进行严格的统计推断。关于差异基因表达的研究。然后,以Beta混合模型为基础,建立了进行统计推断的概率框架。推断的最终结果是每个基因的结果的有序三联体:(i)相对分数表达水平,(ii)该数字表示较低,较高或没有差异表达的相关概率,以及(iii)度量与所述结果相关联的置信度(由重复项估计的变异性确定,或者在没有重复项时通过误差传播技术确定);概率框架的应用通过DNA损伤后放射性杜氏球菌基因表达研究的实验数据的详细处理来说明。该技术还成功地在众所周知的数据集上进行了测试,这些数据集已在生物信息学文献中使用不同的统计技术进行了深入研究。此外,概率框架已通过实验验证。验证研究涉及的基本步骤包括:(i)基于高表达状态的概率和与这些概率相关的置信度,对Affymetrix GeneChip阵列数据进行分析并选择一些候选基因,以及(ii)对使用互补的高精度而非高通量克隆技术,选择基因的真实表达状态(上调,下调或未差异表达)。概率框架推断的结果与高精度的验证结果吻合良好,

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号