首页> 外文学位 >Submodular Optimization and Data Processing.
【24h】

Submodular Optimization and Data Processing.

机译:次模块优化和数据处理。

获取原文
获取原文并翻译 | 示例

摘要

Data sets are large and and are getting larger. Two common paradigms -- data summarization and data partitioning, are often used to handle the big data. Data summarization aims at identifying a small sized subset of the data that attains the maximum utility or information, while the goal of data partitioning is to split the data across multiple compute nodes so that the data block residing on each node becomes manageable. In this dissertation, we investigate how to apply submodularity to these two data processing paradigms.;In the first part of this thesis, we study the connection of submodularity to the data summarization paradigm. First we show that data summarization subsumes a number of applications, including acoustic data subset selection for training speech recognizers [Wei et al., 2014], genomics assay panel selection [Wei et al., 2016], batch active learning [Wei et al., 2015], image summarization [Tschiatschek et al., 2014], document summarization [Lin and Bilmes, 2012], feature subset selection [Liu et al., 2013], and etc. Among these tasks, we perform case studies on the former three applications. We show how to apply the appropriate submodular set functions to model the utility for these tasks, and formulate the correspond- ing data summarization task as a constrained submodular maximization, which admits an efficient greedy heuristic for optimization [Nemhauser et al., 1978]. To better model the util- ity function for an underlying data summarization task, we also propose a novel "interactive" setting for learning mixtures of submodular functions. For such interactive learning setting, we propose an algorithmic framework and show that it is effective for both the acoustic data selection and the image summarization tasks. While the simple greedy heuristic al- ready efficiently and near-optimally solves the constrained submodular maximization, data summarization tasks may still be computationally challenging for large-scale scenarios. To this end, we introduce a novel multistage algorithmic framework called MultiGreed, to significantly scale the greedy algorithm to even larger problem instances. We theoretically show that MultGreed performs very closely to the greedy algorithm and also empirically demonstrate the significant speedup of MultGreed over the standard greedy algorithm on a number of real-world data summarization tasks.;In the second part of this thesis, we connect submodularity to data partitioning. We first propose two novel submodular data partitioning problems that we collectively call Submodular Partitioning. To solve the submodular partitioning, we propose several novel algorithmic frameworks (including greedy, majorization-minimization, minorization-maximization, and relaxation algorithms) that not only scale to large datasets but that also achieve theoretical approximation guarantees comparable to the state-of-the-art. We show that submodular partitioning subsumes a number of machine learning applications, including load balancing for parallel systems, intelligent data partitioning for parallel training of statistical models, and unsupervised image segmentation. We perform a case study on the last application. For this case, we demonstrate the appropriate choice of submodular utility model and the corresponding submodular partitioning formulation. Empirical evidences suggest that the submodular partitioning framework is effective for the intelligent data partitioning task.
机译:数据集越来越大。两种常见的范例-数据汇总和数据分区通常用于处理大数据。数据汇总旨在识别获得最大效用或最大信息的数据的一小部分,而数据分区的目标是将数据拆分到多个计算节点上,从而使每个节点上的数据块变得可管理。本文主要研究如何将亚模态应用于这两个数据处理范式。在本文的第一部分,我们研究了亚模态与数据汇总范式的联系。首先,我们表明数据汇总包含了许多应用程序,包括用于训练语音识别器的声学数据子集选择[Wei等人,2014],基因组分析面板选择[Wei等人,2016],批处理主动学习[Wei等人。,2015],图像摘要[Tschiatschek等,2014],文档摘要[Lin和Bilmes,2012],特征子集选择[Liu等,2013]等。在这些任务中,我们进行了案例研究前三个应用程序。我们展示了如何应用适当的子模集函数对这些任务的效用进行建模,并将相应的数据汇总任务表述为约束子模最大化,这允许进行有效的贪婪启发式优化[Nemhauser et al。,1978]。为了更好地对基础数据汇总任务的效用函数建模,我们还提出了一种新颖的“交互式”设置,用于学习子模块函数的混合。对于这种交互式学习设置,我们提出了一种算法框架,并表明它对于声学数据选择和图像摘要任务均有效。尽管简单的贪婪启发式算法已经有效且接近最优地解决了受约束的子模最大化,但对于大规模场景而言,数据汇总任务可能仍然在计算上具有挑战性。为此,我们引入了一种称为MultiGreed的新颖的多阶段算法框架,以将贪婪算法显着扩展到更大的问题实例。我们从理论上证明MultGreed与贪婪算法的性能非常接近,并且还通过经验证明了MultGreed在许多实际数据汇总任务上比标准贪婪算法有明显的提速。;在本文的第二部分,我们将子模态与数据分区。我们首先提出两个新颖的亚模块数据分区问题,我们统称为亚模块分区。为了解决亚模块划分问题,我们提出了几种新颖的算法框架(包括贪婪,主化-最小化,子化-最大化和松弛算法),它们不仅可以扩展到大型数据集,而且还可以实现与当前状态相当的理论逼近保证。 -艺术。我们表明,子模块划分包含了许多机器学习应用程序,包括并行系统的负载平衡,统计模型的并行训练的智能数据划分以及无监督的图像分割。我们对最后一个应用程序进行案例研究。对于这种情况,我们演示了适当选择亚模效用模型和相应的亚模分配公式。经验证据表明,亚模块分割框架对于智能数据分割任务是有效的。

著录项

  • 作者

    Wei, Kai.;

  • 作者单位

    University of Washington.;

  • 授予单位 University of Washington.;
  • 学科 Electrical engineering.;Bioinformatics.;Computer science.
  • 学位 Ph.D.
  • 年度 2016
  • 页码 286 p.
  • 总页数 286
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号