首页> 外文学位 >Semi-supervised clustering: Probabilistic models, algorithms and experiments.

【24h】

Semi-supervised clustering: Probabilistic models, algorithms and experiments.

机译：半监督聚类：概率模型，算法和实验。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Clustering is one of the most common data mining tasks, used frequently for data categorization and analysis in both industry and academia. The focus of our research is on semi-supervised clustering, where we study how prior knowledge, gathered either from automated information sources or human supervision, can be incorporated into clustering algorithms. In this thesis, we present probabilistic models for semi-supervised clustering, develop algorithms based on these models and empirically validate their performances by extensive experiments on data sets from different domains, e.g., text analysis, hand-written character recognition, and bioinformatics.; In many domains where clustering is applied, some prior knowledge is available either in the form of labeled data (specifying the category to which an instance belongs) or pairwise constraints on some of the instances (specifying whether two instances should be in same or different clusters). In this thesis, we first analyze effective methods of incorporating labeled supervision into prototype-based clustering algorithms, and propose two variants of the well-known KMeans algorithm that can improve their performance with limited labeled data.; We then focus on the problem of semi-supervised clustering with constraints and show how this problem can be studied in the framework of a well-defined probabilistic generative model of a Hidden Markov Random Field. We derive an efficient KMeans-type iterative algorithm, HMRF-KMeans, for optimizing a semi-supervised clustering objective function defined on the HMRF model. We also give convergence guarantees of our algorithm for a large class of clustering distortion measures (e.g., squared Euclidean distance, KL divergence, and cosine distance).; Finally, we develop an active learning algorithm for acquiring maximally informative pairwise constraints in an interactive query-driven framework, which to our knowledge is the first active learning algorithm for semi-supervised clustering with constraints.; Other interesting problems of semi-supervised clustering that we discuss in this thesis include (1) semi-supervised graph-based clustering using kernels, (2) using prior knowledge to improve overlapping clustering of data, (3) integration of both constraint based and distance-based semi-supervised clustering methods using the HMRF model, and (4) model selection techniques that use the available supervision to automatically select the right number of clusters.

机译：群集是最常见的数据挖掘任务之一，在行业和学术界都经常用于数据分类和分析。我们的研究重点是半监督聚类，其中我们研究如何将从自动信息源或人工监督中收集的先验知识整合到聚类算法中。在本文中，我们提出了半监督聚类的概率模型，并基于这些模型开发了算法，并通过对来自不同领域的数据集进行了广泛的实验，例如文本分析，手写字符识别和生物信息学，以经验方式验证了它们的性能。在许多应用群集的域中，可以以标记数据的形式（指定实例所属的类别）或某些实例的成对约束（指定两个实例应位于相同还是不同的群集中）的形式获得一些先验知识。）。在本文中，我们首先分析了将标记监督结合到基于原型的聚类算法中的有效方法，并提出了两种著名的KMeans算法的变体，它们可以在有限的标记数据下提高其性能。然后，我们将重点放在带约束的半监督聚类问题上，并说明如何在一个定义良好的隐马尔可夫随机场的概率生成模型框架内研究此问题。我们推导了一种有效的KMeans型迭代算法HMRF-KMeans，用于优化在HMRF模型上定义的半监督聚类目标函数。我们还为一大类聚类失真度量（例如平方欧几里德距离，KL散度和余弦距离）提供了算法的收敛性保证。最后，我们开发了一种主动学习算法，用于在交互式查询驱动的框架中获取信息量最大的成对约束，据我们所知，这是第一个具有约束的半监督聚类的主动学习算法。我们在本文中讨论的半监督聚类的其他有趣问题包括（1）使用内核的基于半监督图的聚类，（2）使用先验知识来改进数据的重叠聚类，（3）基于约束和基于约束的集成使用HMRF模型的基于距离的半监督聚类方法，以及（4）使用可用监督自动选择正确数量的聚类的模型选择技术。

著录项

作者
Basu, Sugato.;
展开▼
作者单位

The University of Texas at Austin.;

展开▼
授予单位 The University of Texas at Austin.;
学科 Engineering Electronics and Electrical.; Computer Science.
学位 Ph.D.
年度 2005
页码 174 p.
总页数 174
原文格式 PDF
正文语种 eng
中图分类无线电电子学、电信技术;自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Building energy modeling (BEM) using clustering algorithms and semi-supervised machine learning approaches [J] . Naganathan Hariharan, Chong Wai Oswald, Chen Xuewen Automation in construction . 2016,第pta2期

机译：使用聚类算法和半监督机器学习方法的建筑能源建模（BEM）
2. Building energy modeling (BEM) using clustering algorithms and semi-supervised machine learning approaches [J] . Naganathan Hariharan, Chong Wai Oswald, Chen Xuewen Acta crystallographica. Section F, Structural biology communications . 2016,第2期

机译：使用聚类算法和半监控机器学习方法构建能源建模（BEM）
3. A semi-supervised probabilistic model for clustering large databases of complex images [J] . Chandran S. Nisha, Gangodkar Durgaprasad, Mittal Ankush Multimedia Tools and Applications . 2017,第21期

机译：半监督概率模型，用于对大型复杂图像数据库进行聚类
4. KPML: A Novel Probabilistic Perspective Kernel Mahalanobis Distance Metric Learning Model for Semi-supervised Clustering [C] . Chao Wang, Yongyi Hu, Xiaofeng Gao, International Conference on Database and Expert Systems Applications . 2020

机译：KPML：半监督聚类的新型概率透视核马哈拉鲈鱼距离度量学习模型
5. A parameter selection framework for semi-supervised clustering algorithms. [D] . Pourrajabi, Mojgan. 2013

机译：半监督聚类算法的参数选择框架。
6. Modeling Uncertainties in EEG Microstates: Analysis of Real and Imagined Motor Movements Using Probabilistic Clustering-Driven Training of Probabilistic Neural Networks [O] . Martin Dinov, Robert Leech 2017

机译：在脑电微状态中的不确定性建模：使用概率聚类驱动的概率神经网络训练对真实和想象的运动进行分析
7. Semi-supervised Gaussian Mixture Models Clustering Algorithm Based on Immune Clonal Selection [O] . Wenlong Huang, Xiaodan Wang 2016

机译：半监控高斯混合模型基于免疫克隆选择的聚类算法

Semi-supervised clustering: Probabilistic models, algorithms and experiments.

摘要

著录项

相似文献

相关主题

期刊订阅