首页> 外文学位 >Semi-supervised clustering: Probabilistic models, algorithms and experiments.
【24h】

Semi-supervised clustering: Probabilistic models, algorithms and experiments.

机译:半监督聚类:概率模型,算法和实验。

获取原文
获取原文并翻译 | 示例

摘要

Clustering is one of the most common data mining tasks, used frequently for data categorization and analysis in both industry and academia. The focus of our research is on semi-supervised clustering, where we study how prior knowledge, gathered either from automated information sources or human supervision, can be incorporated into clustering algorithms. In this thesis, we present probabilistic models for semi-supervised clustering, develop algorithms based on these models and empirically validate their performances by extensive experiments on data sets from different domains, e.g., text analysis, hand-written character recognition, and bioinformatics.; In many domains where clustering is applied, some prior knowledge is available either in the form of labeled data (specifying the category to which an instance belongs) or pairwise constraints on some of the instances (specifying whether two instances should be in same or different clusters). In this thesis, we first analyze effective methods of incorporating labeled supervision into prototype-based clustering algorithms, and propose two variants of the well-known KMeans algorithm that can improve their performance with limited labeled data.; We then focus on the problem of semi-supervised clustering with constraints and show how this problem can be studied in the framework of a well-defined probabilistic generative model of a Hidden Markov Random Field. We derive an efficient KMeans-type iterative algorithm, HMRF-KMeans, for optimizing a semi-supervised clustering objective function defined on the HMRF model. We also give convergence guarantees of our algorithm for a large class of clustering distortion measures (e.g., squared Euclidean distance, KL divergence, and cosine distance).; Finally, we develop an active learning algorithm for acquiring maximally informative pairwise constraints in an interactive query-driven framework, which to our knowledge is the first active learning algorithm for semi-supervised clustering with constraints.; Other interesting problems of semi-supervised clustering that we discuss in this thesis include (1) semi-supervised graph-based clustering using kernels, (2) using prior knowledge to improve overlapping clustering of data, (3) integration of both constraint based and distance-based semi-supervised clustering methods using the HMRF model, and (4) model selection techniques that use the available supervision to automatically select the right number of clusters.
机译:群集是最常见的数据挖掘任务之一,在行业和学术界都经常用于数据分类和分析。我们的研究重点是半监督聚类,其中我们研究如何将从自动信息源或人工监督中收集的先验知识整合到聚类算法中。在本文中,我们提出了半监督聚类的概率模型,并基于这些模型开发了算法,并通过对来自不同领域的数据集进行了广泛的实验,例如文本分析,手写字符识别和生物信息学,以经验方式验证了它们的性能。在许多应用群集的域中,可以以标记数据的形式(指定实例所属的类别)或某些实例的成对约束(指定两个实例应位于相同还是不同的群集中)的形式获得一些先验知识。 )。在本文中,我们首先分析了将标记监督结合到基于原型的聚类算法中的有效方法,并提出了两种著名的KMeans算法的变体,它们可以在有限的标记数据下提高其性能。然后,我们将重点放在带约束的半监督聚类问题上,并说明如何在一个定义良好的隐马尔可夫随机场的概率生成模型框架内研究此问题。我们推导了一种有效的KMeans型迭代算法HMRF-KMeans,用于优化在HMRF模型上定义的半监督聚类目标函数。我们还为一大类聚类失真度量(例如平方欧几里德距离,KL散度和余弦距离)提供了算法的收敛性保证。最后,我们开发了一种主动学习算法,用于在交互式查询驱动的框架中获取信息量最大的成对约束,据我们所知,这是第一个具有约束的半监督聚类的主动学习算法。我们在本文中讨论的半监督聚类的其他有趣问题包括(1)使用内核的基于半监督图的聚类,(2)使用先验知识来改进数据的重叠聚类,(3)基于约束和基于约束的集成使用HMRF模型的基于距离的半监督聚类方法,以及(4)使用可用监督自动选择正确数量的聚类的模型选择技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号