首页> 外文学位 >Scalable inference of discrete data: User behavior, networks and genetic variation.

【24h】

Scalable inference of discrete data: User behavior, networks and genetic variation.

机译：可扩展的离散数据推断：用户行为，网络和遗传变异。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Recent years have seen explosive growth in data, models and computation. Massive data sets and sophisticated probabilistic models are increasingly used in the fields of high-energy physics, biology, genetics and in personalization applications; however, many statistical algorithms remain inefficient, impeding scientific progress.;In this thesis, we present several efficient statistical algorithms for learning from massive discrete data sets. We focus on discrete data because complex and structured activity such as chromosome folding in three dimensions, human genetic variation, social network interactions and product ratings are often encoded as simple matrices of discrete numerical observations. Our algorithms derive from a Bayesian perspective and lie in the framework of directed graphical models and mean-field variational inference. Situated in this framework, we gain computational and statistical efficiency through modeling insights and through subsampling informative data during inference.;We begin with additive Poisson factorization models for recommending items to users based on user consumption or ratings. These models provide sparse latent representations of users and items, and capture the long-tailed distributions of user consumption. We use them as building blocks for article recommendation models by sharing latent spaces across readership and article text. We demonstrate that our algorithms scale to massive data sets, are easy to implement and provide competitive user recommendations. Then, we develop a Bayesian nonparametric model in which the latent representations of users and items grow to accommodate new data.;In the second part of the thesis, we develop novel algorithms for discovering overlapping communities in large networks. These algorithms interleave non-uniform subsampling of the network with model estimation. Our network models capture the basic ways in which nodes connect to each other, through similarity and popularity, using mixed-memberships representations and generalized linear model formulation.;Finally, we present the TeraStructure algorithm to fit Bayesian models of genetic variation in human populations on tera-sample-sized data sets (1012 observed genotypes, e.g, 1M individuals at 1M SNPs). On real genomic data collected from thousands of individuals, TeraStructure is faster than existing methods and recovers the latent population structure with equal accuracy. On genomic data simulated at the tera-sample-size scales, TeraStructure is highly accurate and is the only method that can complete its analysis.

机译：近年来，数据，模型和计算的爆炸性增长。在高能物理，生物学，遗传学和个性化应用领域，海量数据集和复杂的概率模型越来越多地被使用。然而，许多统计算法仍然效率低下，阻碍了科学的发展。本文提出了几种从大量离散数据集中学习的有效统计算法。我们专注于离散数据，因为复杂和结构化的活动（例如，三个维度的染色体折叠，人类遗传变异，社交网络互动和产品评分）通常被编码为离散数值观察的简单矩阵。我们的算法是从贝叶斯的角度出发的，并位于有向图模型和均值场变化推断的框架中。在此框架内，我们通过建模洞察和在推理过程中对信息数据进行二次采样来提高计算和统计效率。我们从加性泊松分解模型开始，根据用户的消费量或等级向用户推荐商品。这些模型提供了用户和项目的稀疏潜在表示，并捕获了用户消费的长尾分布。通过在读者和文章文本之间共享潜在空间，我们将它们用作文章推荐模型的构建块。我们证明了我们的算法可扩展到海量数据集，易于实现并提供有竞争力的用户建议。然后，我们开发了一个贝叶斯非参数模型，在该模型中，用户和项目的潜在表示不断增长以容纳新数据。在论文的第二部分，我们开发了用于发现大型网络中重叠社区的新颖算法。这些算法将网络的非均匀子采样与模型估计交织在一起。我们的网络模型使用混合成员表示法和广义线性模型表示法，通过相似度和流行度来捕获节点之间相互连接的基本方式。最后，我们提出了TeraStructure算法以拟合人类遗传变异的贝叶斯模型。兆样本大小的数据集（观察到1012个基因型，例如1M SNP处的1M个人）。根据从数千个个体收集的真实基因组数据，TeraStructure比现有方法要快，并且可以以相同的精度恢复潜在种群结构。在以tera样本大小规模模拟的基因组数据上，TeraStructure高度准确，是唯一可以完成其分析的方法。

著录项

作者
Gopalan, Prem K.;
展开▼
作者单位

Princeton University.;

展开▼
授予单位 Princeton University.;
学科 Computer Science.
学位 Ph.D.
年度 2015
页码 157 p.
总页数 157
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Predictive modeling of everyday behavior from large-scale data : Learning and inference from Bayesian networks based on actual services [J] . Yoichi Motomura Synthesiology . 2009,第1期

机译：基于大规模数据的日常行为的预测模型：基于实际服务的贝叶斯网络学习和推断
2. A divide-and-conquer method for scalable phylogenetic network inference from multilocus data [J] . Zhu Jiafan, Liu Xinhao, Ogilvie Huw A., Bioinformatics . 2019,第14期

机译：来自多层数据的可伸缩系统发育网络推断的划分和征服方法
3. A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation [J] . Hussein A. Hejase, Kevin J. Liu BMC Bioinformatics . 2016,第1期

机译：系统发育网络推理方法的可扩展性研究，使用经验数据集和涉及单个网状结构的模拟
4. FastNet: Fast and Accurate Statistical Inference of Phylogenetic Networks Using Large-Scale Genomic Sequence Data [C] . Hussein A. Hejase, Natalie VandePol, Gregory M. Bonito, International workshop on comparative genomics . 2018

机译：FastNet：使用大规模基因组序列数据对系统发生网络进行快速而准确的统计推断
5. Analyzing Large Scale Trajectory Data to Identify Users with Similar Behavior [D] . Percy, Tyler Clark 2018

机译：分析大规模轨迹数据以识别行为相似的用户
6. A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation [O] . Hussein A. Hejase, Kevin J. Liu 2016

机译：系统发育网络推理方法的可扩展性研究使用经验数据集和涉及单个网状结构的模拟
7. A divide-and-conquer method for scalable phylogenetic network inference from multilocus data [O] . Jiafan Zhu, Xinhao Liu, Huw A Ogilvie, 2019

机译：来自多层数据的可伸缩系统发育网络推断的划分和征服方法
8. Scalable Inference of Discrete Data: User Behavior, Networks and Genetic Variation. [R] . Gopalan, P. K. 2015

机译：离散数据的可扩展推理：用户行为，网络和遗传变异。

Scalable inference of discrete data: User behavior, networks and genetic variation.

摘要

著录项

相似文献

相关主题

期刊订阅