首页> 外文会议>ACM SIGKDD international conference on knowledge discovery and data mining;KDD 10 >Redefining Class Definitions using Constraint-Based Clustering
【24h】

Redefining Class Definitions using Constraint-Based Clustering

机译:使用基于约束的聚类重新定义类定义

获取原文

摘要

Two aspects are crucial when constructing any real world supervised classification task: the set of classes whose distinction might be useful for the domain expert, and the set of classifications that can actually be distinguished by the data. Often a set of labels is defined with some initial intuition but these are not the best match for the task. For example, labels have been assigned for land cover classification of the Earth but it has been suspected that these labels are not ideal and some classes may be best split into subclasses whereas others should be merged. This paper formalizes this problem using three ingredients: the existing class labels, the underlying separability in the data, and a special type of input from the domain expert. We require a domain expert to specify an L×L matrix of pairwise probabilistic constraints expressing their beliefs as to whether the L classes should be kept separate, merged, or split. This type of input is intuitive and easy for experts to supply. We then show that the problem can be solved by casting it as an instance of penalized probabilistic clustering (PPC). Our method, Class-Level PPC (CPPC) extends PPC showing how its time complexity can be reduced from O(N2) to O(NL) for the problem of class re-definition. We further extend the algorithm by presenting a heuristic to measure adherence to constraints, and providing a criterion for determining the model complexity (number of classes) for constraint-based clustering. We demonstrate and evaluate CPPC on artificial data and on our motivating domain of land cover classification. For the latter, an evaluation by domain experts shows that the algorithm discovers novel class definitions that are better suited to land cover classification than the original set of labels.
机译:在构造任何现实世界中受监督的分类任务时,有两个方面至关重要:一类类别的区分可能对领域专家有用,一类类别可以由数据实际区分。通常,一些初始的直觉定义了一组标签,但是这些标签并不是最适合任务的。例如,已经为地球的土地覆盖分类指定了标签,但是怀疑这些标签不是理想的,某些类别可能最好划分为子类别,而其他类别则应合并。本文使用三个要素对这个问题进行了形式化:现有的类标签,数据中潜在的可分离性以及领域专家的特殊输入。我们要求领域专家指定成对概率约束的L×L矩阵,以表达他们对L类应保持独立,合并还是分割的信念。这种类型的输入直观且易于专家提供。然后,我们证明可以通过将其强制转换为惩罚概率聚类(PPC)实例来解决该问题。我们的方法,类级PPC(CPPC)扩展了PPC,显示了针对类重定义问题如何将其时间复杂度从O(N2)降低到O(NL)。我们通过提供一种启发式方法来测量对约束的依从性,并为确定基于约束的聚类提供模型复杂度(类数)的标准,来进一步扩展算法。我们在人工数据和土地覆盖分类的激励领域上演示并评估了CPPC。对于后者,领域专家进行的评估表明,该算法发现了比原始标签集更适合土地覆盖分类的新颖分类定义。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号