首页> 外文学位 >Scalable inference of discrete data: User behavior, networks and genetic variation.
【24h】

Scalable inference of discrete data: User behavior, networks and genetic variation.

机译:可扩展的离散数据推断:用户行为,网络和遗传变异。

获取原文
获取原文并翻译 | 示例

摘要

Recent years have seen explosive growth in data, models and computation. Massive data sets and sophisticated probabilistic models are increasingly used in the fields of high-energy physics, biology, genetics and in personalization applications; however, many statistical algorithms remain inefficient, impeding scientific progress.;In this thesis, we present several efficient statistical algorithms for learning from massive discrete data sets. We focus on discrete data because complex and structured activity such as chromosome folding in three dimensions, human genetic variation, social network interactions and product ratings are often encoded as simple matrices of discrete numerical observations. Our algorithms derive from a Bayesian perspective and lie in the framework of directed graphical models and mean-field variational inference. Situated in this framework, we gain computational and statistical efficiency through modeling insights and through subsampling informative data during inference.;We begin with additive Poisson factorization models for recommending items to users based on user consumption or ratings. These models provide sparse latent representations of users and items, and capture the long-tailed distributions of user consumption. We use them as building blocks for article recommendation models by sharing latent spaces across readership and article text. We demonstrate that our algorithms scale to massive data sets, are easy to implement and provide competitive user recommendations. Then, we develop a Bayesian nonparametric model in which the latent representations of users and items grow to accommodate new data.;In the second part of the thesis, we develop novel algorithms for discovering overlapping communities in large networks. These algorithms interleave non-uniform subsampling of the network with model estimation. Our network models capture the basic ways in which nodes connect to each other, through similarity and popularity, using mixed-memberships representations and generalized linear model formulation.;Finally, we present the TeraStructure algorithm to fit Bayesian models of genetic variation in human populations on tera-sample-sized data sets (1012 observed genotypes, e.g, 1M individuals at 1M SNPs). On real genomic data collected from thousands of individuals, TeraStructure is faster than existing methods and recovers the latent population structure with equal accuracy. On genomic data simulated at the tera-sample-size scales, TeraStructure is highly accurate and is the only method that can complete its analysis.
机译:近年来,数据,模型和计算的爆炸性增长。在高能物理,生物学,遗传学和个性化应用领域,海量数据集和复杂的概率模型越来越多地被使用。然而,许多统计算法仍然效率低下,阻碍了科学的发展。本文提出了几种从大量离散数据集中学习的有效统计算法。我们专注于离散数据,因为复杂和结构化的活动(例如,三个维度的染色体折叠,人类遗传变异,社交网络互动和产品评分)通常被编码为离散数值观察的简单矩阵。我们的算法是从贝叶斯的角度出发的,并位于有向图模型和均值场变化推断的框架中。在此框架内,我们通过建模洞察和在推理过程中对信息数据进行二次采样来提高计算和统计效率。我们从加性泊松分解模型开始,根据用户的消费量或等级向用户推荐商品。这些模型提供了用户和项目的稀疏潜在表示,并捕获了用户消费的长尾分布。通过在读者和文章文本之间共享潜在空间,我们将它们用作文章推荐模型的构建块。我们证明了我们的算法可扩展到海量数据集,易于实现并提供有竞争力的用户建议。然后,我们开发了一个贝叶斯非参数模型,在该模型中,用户和项目的潜在表示不断增长以容纳新数据。在论文的第二部分,我们开发了用于发现大型网络中重叠社区的新颖算法。这些算法将网络的非均匀子采样与模型估计交织在一起。我们的网络模型使用混合成员表示法和广义线性模型表示法,通过相似度和流行度来捕获节点之间相互连接的基本方式。最后,我们提出了TeraStructure算法以拟合人类遗传变异的贝叶斯模型。兆样本大小的数据集(观察到1012个基因型,例如1M SNP处的1M个人)。根据从数千个个体收集的真实基因组数据,TeraStructure比现有方法要快,并且可以以相同的精度恢复潜在种群结构。在以tera样本大小规模模拟的基因组数据上,TeraStructure高度准确,是唯一可以完成其分析的方法。

著录项

  • 作者

    Gopalan, Prem K.;

  • 作者单位

    Princeton University.;

  • 授予单位 Princeton University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2015
  • 页码 157 p.
  • 总页数 157
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号