首页> 外文学位 >New Advancements of Scalable Statistical Methods for Learning Latent Structures in Big Data.
【24h】

New Advancements of Scalable Statistical Methods for Learning Latent Structures in Big Data.

机译:用于学习大数据潜在结构的可伸缩统计方法的新进展。

获取原文
获取原文并翻译 | 示例

摘要

Constant technology advances have caused data explosion in recent years. Accordingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.;Previous statistical methods for big data often aim to find low dimensional structures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture proportions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.
机译:不断的技术进步导致近年来的数据爆炸。因此,现代统计和机器学习方法必须适应于处理复杂和异构的数据类型。这种现象在分析生物学数据时尤其如此。例如,DNA序列数据可以被视为分类变量,每个核苷酸具有四个不同的类别。基因表达数据取决于定量技术,可以是连续的数字或计数。随着高通量技术的发展,此类数据的丰富性变得空前丰富。因此,有效的统计方法在这个大数据时代至关重要。;以前的大数据统计方法通常旨在在观察到的数据中找到低维结构。例如,在因子分析模型中,假定了一个潜在的高斯分布多元向量。在这种假设下,因子模型对观察到的变量的协方差产生低秩估计。另一个示例是潜在的Dirichlet文档分配模型。假定主题的混合比例由Dirichlet分布变量表示。本文提出了对以前的统计方法的一些新颖的扩展,以应对大数据中的挑战。这些新颖的方法被应用于多种实际应用中,包括构建条件特定的基因共表达网络,估计新闻组之间的共享主题,分析启动子序列,分析政治经济学风险数据以及根据基因型数据估算人口结构。

著录项

  • 作者

    Zhao, Shiwen.;

  • 作者单位

    Duke University.;

  • 授予单位 Duke University.;
  • 学科 Statistics.;Mathematics.;Bioinformatics.
  • 学位 Ph.D.
  • 年度 2016
  • 页码 203 p.
  • 总页数 203
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号