首页> 外文学位 >Scalable Bayesian Nonparametrics and Sparse Learning for Hidden Relationship Discovery
【24h】

Scalable Bayesian Nonparametrics and Sparse Learning for Hidden Relationship Discovery

机译:隐藏关系发现的可扩展贝叶斯非参数和稀疏学习

获取原文
获取原文并翻译 | 示例

摘要

Real-world data often encompass hidden relationships, such as interactions between modes in multidimensional arrays (or tensors), subsets of features correlated to specific responses, and associations between heterogeneous data sources. Uncovering these relationships is a key problem in machine learning and data mining, and relates to numerous applications ranging from information security to imaging genetics and to computational advertisement.;However, to mine these relationships, we have to face several significant challenges. First, how can we design powerful models to capture the complicated, potentially highly nonlinear patterns in data? Second, how can we develop efficient model estimation algorithms to deal with real-world large data volumes, say, millions of features and billions of tensor elements?;In this dissertation, we aim to address these challenges using Bayesian learning techniques. Compared with other types of methodologies, Bayesian learning has a unique advantage --- it provides a highly principled, interpretable mathematical framework for data modeling and reasoning under uncertainty. We use two families of Bayesian approaches, namely Bayesian nonparametrics and sparse learning, to uncover the fundamental relationships hidden in data. That is, the interactive relationships between multiple entities within tensors, where each mode represents a particular type of entity, e.g. a three-mode (user, movie, music) tensor, and the correlated relationships between features and responses in high dimensional and multiview data.;Bayesian nonparametrics allow the number of model parameters to grow along with data and hence can automatically fit the complexity of the data patterns. Therefore, Bayesian nonparametric models are powerful to capture the complicated, (possibly) highly nonlinear interactions. Bayesian sparse learning filters out noises and identifies useful, succinct patterns from data and therefore are particularly suitable to discover the correlated relationships, which are often sparse in the data.;To address the computational challenges in large-scale applications, we explore various means, such as divide-and-conquer modeling, local computation, variational transformations and factorized approximations, to obtain decomposable mathematical structures in the learning objective functions. Based on these, we develop efficient parallel or online model estimation algorithms to handle real-world large-scale data.;Specifically, first, we design Bayesian nonparametric factorization models, based on Gaussian processes and Dirichlet processes, to capture the nonlinear interactive relationships underlying tensor data and to further discover hidden clusters within tensor modes. We develop a scalable online inference algorithm on a single machine, as well as highly efficient parallel inference algorithms for use on Hadoop and Spark clusters. We demonstrate their impressive accuracy gains for tensor completion tasks in billion-entry data, as compared with the traditional methods.;Second, based on the spike-and-slab prior, we develop an online Bayesian sparse learning algorithm to identify subsets of features correlated to interesting responses, from large-scale high dimensional data with millions of samples and features. We demonstrate its significant advantages over competing state-of-the-art approaches in large-scale applications including text classification and click-through-rate prediction for online advertising. Finally, in order to capture the cross correlations between features from heterogeneous data views, we use the spike-and-slab priors and Gaussian processes to develop a sparse multiview learning model. We show its successful application in association discovery and diagnosis in data from an Alzheimer's disease study.
机译:现实世界的数据通常包含隐藏的关系,例如多维数组(或张量)中的模式之间的交互作用,与特定响应相关的特征子集以及异构数据源之间的关联。揭示这些关系是机器学习和数据挖掘中的关键问题,并且涉及从信息安全到成像遗传学以及计算广告的众多应用。然而,要挖掘这些关系,我们必须面对几个重大挑战。首先,我们如何设计功能强大的模型来捕获数据中复杂的,潜在的高度非线性模式?其次,如何开发有效的模型估计算法来处理现实世界中的大数据量,例如数以百万计的特征和数十亿张量元素?;本文旨在使用贝叶斯学习技术来应对这些挑战。与其他类型的方法相比,贝叶斯学习具有独特的优势-它为不确定性下的数据建模和推理提供了高度原则化,可解释的数学框架。我们使用贝叶斯方法的两个族,即贝叶斯非参数方法和稀疏学习方法,来发现隐藏在数据中的基本关系。即,张量内的多个实体之间的交互关系,其中每个模式表示特定类型的实体,例如一个三模式(用户,电影,音乐)张量,以及高维和多视图数据中特征与响应之间的相关关系;贝叶斯非参数允许模型参数的数量与数据一起增长,因此可以自动适应数据模式。因此,贝叶斯非参数模型具有强大的功能,可以捕获复杂的(可能)高度非线性的相互作用。贝叶斯稀疏学习可过滤掉噪声并从数据中识别出有用的简洁模式,因此特别适合发现数据中通常稀疏的相关关系。为了解决大规模应用中的计算难题,我们探索了各种方法,例如分治法建模,局部计算,变分变换和因式近似,以在学习目标函数中获得可分解的数学结构。基于这些,我们开发了有效的并行或在线模型估计算法来处理现实世界中的大规模数据。具体来说,首先,我们基于高斯过程和Dirichlet过程设计贝叶斯非参数因式分解模型,以捕获潜在的非线性交互关系。张量数据并进一步发现张量模式下的隐藏簇。我们在单台机器上开发了可扩展的在线推理算法,以及在Hadoop和Spark集群上使用的高效并行推理算法。与传统方法相比,我们证明了它们在十亿个条目数据中的张量完成任务上获得的令人印象深刻的精度提升;其次,基于先验的基础上,我们开发了一种在线贝叶斯稀疏学习算法来识别相关特征的子集从具有数百万个样本和特征的大规模高维数据获得有趣的响应。我们证明了其在大型应用程序(包括文本分类和在线广告的点击率预测)中与最先进的竞争方法相比所具有的显着优势。最后,为了从异构数据视图中捕获特征之间的互相关性,我们使用了尖峰和先验先验和高斯过程来开发稀疏的多视图学习模型。我们展示了它在阿尔茨海默氏病研究数据的关联发现和诊断中的成功应用。

著录项

  • 作者

    Zhe, Shandian.;

  • 作者单位

    Purdue University.;

  • 授予单位 Purdue University.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 161 p.
  • 总页数 161
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号