首页> 外文期刊>Journal of Computer Science & Systems Biology >Non-Parametric Bayesian Modelling of Digital Gene Expression Data
【24h】

Non-Parametric Bayesian Modelling of Digital Gene Expression Data

机译:数字基因表达数据的非参数贝叶斯建模

获取原文
       

摘要

Next-generation sequencing technologies provide a revolutionary tool for generating gene expression data. Starting with a fixed RNA sample, they construct a library of millions of differentially abundant short sequence tags or “reads”, which constitute a fundamentally discrete measure of the level of gene expression. A common limitation in experiments using these technologies is the low number or even absence of biological replicates, which complicates the statistical analysis of digital gene expression data. Analysis of this type of data has often been based on modified tests originally devised for analysing microarrays; both these and even de novo methods for the analysis of RNA-seq data are plagued by the common problem of low replication. We propose a novel, non-parametric Bayesian approach for the analysis of digital gene expression data. We begin with a hierarchical model for modelling over-dispersed count data and a blocked Gibbs sampling algorithm for inferring the posterior distribution of model parameters conditional on these counts. The algorithm compensates for the problem of low numbers of biological replicates by clustering together genes with tag counts that are likely sampled from a common distribution and using this augmented sample for estimating the parameters of this distribution. The number of clusters is not decided a priori, but it is inferred along with the remaining model parameters. We demonstrate the ability of this approach to model biological data with high fidelity by applying the algorithm on a public dataset obtained from cancerous and non-cancerous neural tissues. Source code implementing the methodology presented in this paper takes the form of the Python Package DGEclust, which is freely available at the following link: https://bitbucket.org/DimitrisVavoulis/dgeclust.
机译:下一代测序技术为生成基因表达数据提供了革命性的工具。从固定的RNA样品开始,他们构建了数百万个差异丰富的短序列标签或“读物”的库,它们构成了基因表达水平的根本离散量度。使用这些技术的实验中常见的局限性是生物复制的数量很少甚至没有,这使数字基因表达数据的统计分析变得复杂。这类数据的分析通常基于最初为分析微阵列而设计的改良测试。这些乃至从头开始的用于分析RNA-seq数据的方法都受到复制率低的普遍问题的困扰。我们提出了一种新颖的非参数贝叶斯方法来分析数字基因表达数据。我们从用于建模过度分散的计数数据的分层模型和用于推断以这些计数为条件的模型参数的后验分布的阻塞Gibbs采样算法开始。该算法通过将基因与可能从共同分布中采样的标签计数聚类在一起,并使用这种扩增后的样本来估算该分布的参数,从而弥补了生物复制数量少的问题。聚类的数量不是先验确定的,而是与其余模型参数一起推断的。通过在从癌性和非癌性神经组织获得的公共数据集上应用该算法,我们证明了该方法能够以高保真度对生物数据进行建模的能力。实现本文介绍的方法的源代码采用Python软件包DGEclust的形式,可从以下链接免费获得:https://bitbucket.org/DimitrisVavoulis/dgeclust。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号