Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

Vipin Balachandran; Deepak P; Deepak Khemani

首页> 外文期刊>Knowledge and Information Systems >Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

【24h】

Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

机译：通过导出基于单词的规则，可解释和可重新配置的文档数据集聚类

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Clusters of text documents output by clustering algorithms are often hard to interpret. We describe motivating real-world scenarios that necessitate reconfigurability and high interpretability of clusters and outline the problem of generating clusterings with interpretable and reconfigurable cluster models. We develop two clustering algorithms toward the outlined goal of building interpretable and reconfigurable cluster models. They generate clusters with associated rules that are composed of conditions on word occurrences or nonoccurrences. The proposed approaches vary in the complexity of the format of the rules; RGC employs disjunctions and conjunctions in rule generation whereas RGC-D rules are simple disjunctions of conditions signifying presence of various words. In both the cases, each cluster is comprised of precisely the set of documents that satisfy the corresponding rule. Rules of the latter kind are easy to interpret, whereas the former leads to more accurate clustering. We show that our approaches outperform the unsupervised decision tree approach for rule-generating clustering and also an approach we provide for generating interpretable models for general clusterings, both by significant margins. We empirically show that the purity and f-measure losses to achieve interpretability can be as little as 3 and 5%, respectively using the algorithms presented herein.

机译：通过聚类算法输出的文本文档的聚类通常难以解释。我们描述了激励现实世界的场景，这些场景需要集群的可重新配置性和高度可解释性，并概述了使用可解释和可重新配置的集群模型生成集群的问题。我们针对构建可解释和可重新配置的群集模型的概述目标开发了两种群集算法。它们生成具有相关规则的聚类，这些规则由单词出现或不出现的条件组成。提议的方法在规则格式的复杂性方面有所不同。 RGC在规则生成中采用析取和连词，而RGC-D规则是表示存在各种单词的条件的简单析取。在这两种情况下，每个群集都精确地由满足相应规则的一组文档组成。后者的规则易于解释，而前者则导致更准确的聚类。我们表明，我们的方法优于无监督决策树方法的规则生成聚类方法，还提供了为常规聚类生成可解释模型的方法，两者均具有显着优势。我们凭经验表明，使用本文介绍的算法，实现可解释性的纯度和f度量损失分别低至3％和5％。

著录项

来源
《Knowledge and Information Systems》 |2012年第3期|p.475-503|共29页
作者
Vipin Balachandran; Deepak P; Deepak Khemani;
展开▼
作者单位

VMware, Bangalore, India;

IBM Research, Bangalore, India;

IIT Madras, Chennai, India;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Data clustering; Text clustering; Interpretability;

机译：数据聚类;文本聚类;可解释性;

相似文献

外文文献
中文文献
专利

1. Interpretable and reconfigurable clustering of document datasets by deriving word-based rules [J] . Vipin Balachandran, Deepak P., Deepak Khemani Knowledge and information systems . 2012,第3期

机译：通过导出基于单词的规则，可解释和可重新配置的文档数据集聚类
2. A hybrid reciprocal model of PCA and K-means with an innovative approach of considering sub-datasets for the improvement of K-means initialization and step-by-step labeling to create clusters with high interpretability [J] . Anaraki Seyed Alireza Mousavian, Haeri Abdorrahman, Moslehi Fateme Pattern Analysis and Applications . 2021,第3期

机译：具有创新方法的PCA和K-in的混合互惠模型，其考虑子数据集改进K-Means初始化和逐步标记，以创建具有高可解释性的群集
3. An interpretable framework for clustering single-cell RNA-Seq datasets [J] . Jesse M. Zhang, Jue Fan, H. Christina Fan, BMC Bioinformatics . 2018,第1期

机译：用于对单细胞RNA-Seq数据集进行聚类的可解释框架
4. Interpretable and reconfigurable clustering of document datasets by deriving word-based rules [C] . Vipin Balachandran, Deepak P, Deepak Khemani 18th ACM conference on information and knowledge management 2009 . 2009

机译：通过导出基于单词的规则，可解释和可重新配置的文档数据集聚类
5. Supervised precision ordinal clustering – A human-machine learning algorithm to create accurate clusters in big datasets: Application to indiana water quality data with novel visualization techniques [D] . Singh, Sarabjit 2014

机译：有监督的有序序数聚类–一种人机学习算法，可在大型数据集中创建准确的聚类：采用新颖的可视化技术应用于印第安纳州水质数据
6. An interpretable framework for clustering single-cell RNA-Seq datasets [O] . Jesse M. Zhang, Jue Fan, H. Christina Fan, 2018

机译：用于解释单细胞RNA-Seq数据集的可解释框架
7. Interpretable and Reconfigurable Clustering of Document Datasets by Deriving Word-based Rules [O] . Balachandran, Vipin, Padmanabhan, Deepak, Khemani, Deepak 2012

机译：通过导出基于单词的规则可解释和可重新配置的文档数据集聚类

Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

摘要

著录项

相似文献

相关主题

期刊订阅