首页> 外文会议>Information Reuse and Integration, 2007 IEEE International Conference on >Extracting Partitional Clusters from Heterogeneous Datasets using Mutual Entropy
【24h】

Extracting Partitional Clusters from Heterogeneous Datasets using Mutual Entropy

机译:使用互熵从异构数据集中提取分区聚类

获取原文
获取原文并翻译 | 示例

摘要

Clustering has traditionally been used for partitioning the objects of a single dataset. Some applications may require the clustering of multiple related heterogeneous datasets where it may not be easy to compute a useful and effective integrated feature space. In this paper, we present an algorithm called CEMENT (Cluster Ensemble using Mutual ENTropy) to address the problem of clustering two related datasets where the datasets represent the same or overlapping sets of objects but use different feature sets. The algorithm takes the partitional clusters generated from two datasets as input and uses a constraint-based approach to generate a single set of clusters. CEMENT is an EM (expectation maximization) approach where the objective function is the mutual entropy between the two sets of clusters. The algorithm was applied to the problem of clustering a document collection consisting of journal abstracts from ten different Library of Congress categories. These documents were pre-processed using several NLP (natural language processing) steps to extract syntactic and semantic feature sets. We present empirical results and statistical tests showing that CEMENT yields higher quality clusters with this dataset than several baseline clustering approaches.
机译:传统上,将聚类用于对单个数据集的对象进行分区。某些应用程序可能需要对多个相关的异类数据集进行聚类,在这些聚类中,要计算有用和有效的集成特征空间可能并不容易。在本文中,我们提出了一种称为CEMENT(使用互Entropy的聚类集成)的算法,以解决将两个相关数据集聚类的问题,其中数据集表示相同或重叠的对象集,但使用不同的特征集。该算法将从两个数据集生成的分区聚类作为输入,并使用基于约束的方法来生成单个聚类集。 CEMENT是一种EM(期望最大化)方法,其中目标函数是两组聚类之间的互熵。该算法应用于将文档集合聚类的问题,该文档集合由来自十个不同国会图书馆类别的期刊摘要组成。这些文档已使用几个NLP(自然语言处理)步骤进行了预处理,以提取语法和语义特征集。我们提供的经验结果和统计测试表明,与一些基准聚类方法相比,CEMENT使用此数据集可产生更高质量的聚类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号