首页> 外文期刊>IEEE Transactions on Pattern Analysis and Machine Intelligence >A mixture model and EM-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets
【24h】

A mixture model and EM-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets

机译:混合模型和基于EM的算法,用于混合标记/未标记数据集中的类发现,鲁棒分类和异常剔除

获取原文
获取原文并翻译 | 示例

摘要

Several authors have shown that, when labeled data are scarce, improved classifiers can be built by augmenting the training set with a large set of unlabeled examples and then performing suitable learning. These works assume each unlabeled sample originates from one of the (known) classes. Here, we assume each unlabeled sample comes from either a known or from a heretofore undiscovered class. We propose a novel mixture model which treats as observed data not only the feature vector and the class label, but also the fact of label presence/absence for each sample. Two types of mixture components are posited. "Predefined" components generate data from known classes and assume class labels are missing at random. "Nonpredefined" components only generate unlabeled data-i.e., they capture exclusively unlabeled subsets, consistent with an outlier distribution or new classes. The predefinedonpredefined natures are data-driven, learned along with the other parameters via an extension of the EM algorithm. Our modeling framework addresses problems involving both the known,and unknown classes: (1) robust classifier design, (2) classification with rejections, and (3) identification of the unlabeled samples (and their components) from unknown classes. Case 3 is a step toward new class discovery. Experiments are reported for each application, including topic discovery for the Reuters domain. Experiments also demonstrate the value of label presence/absence data in learning accurate mixtures.
机译:几位作者表明,当缺少标签数据时,可以通过使用大量未标签的示例扩展训练集,然后进行适当的学习来构建改进的分类器。这些工作假定每个未标记的样本都来自(已知)类别之一。在此,我们假设每个未标记的样本都来自已知或迄今尚未发现的类别。我们提出了一种新颖的混合模型,该模型不仅将特征矢量和类别标签作为观察数据,而且还将每个样本的标签存在/不存在作为事实。放置了两种类型的混合物成分。 “预定义”组件从已知类生成数据,并假定随机缺少类标签。 “未预定义的”组件仅生成未标记的数据,即,它们捕获与异常分布或新类一致的排他性地未标记的子集。预定义/未预定义的性质是数据驱动的,可通过EM算法的扩展与其他参数一起学习。我们的建模框架解决了涉及已知和未知类的问题:(1)稳健的分类器设计;(2)剔除分类;(3)从未知类中识别未标记的样本(及其成分)。案例3是迈向新类发现的一步。报告了每种应用的实验,包括路透社领域的主题发现。实验还证明了标签存在/不存在数据在学习准确混合物中的价值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号