首页> 外文期刊>Knowledge and Information Systems >Interpretable and reconfigurable clustering of document datasets by deriving word-based rules
【24h】

Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

机译:通过导出基于单词的规则,可解释和可重新配置的文档数据集聚类

获取原文
获取原文并翻译 | 示例
       

摘要

Clusters of text documents output by clustering algorithms are often hard to interpret. We describe motivating real-world scenarios that necessitate reconfigurability and high interpretability of clusters and outline the problem of generating clusterings with interpretable and reconfigurable cluster models. We develop two clustering algorithms toward the outlined goal of building interpretable and reconfigurable cluster models. They generate clusters with associated rules that are composed of conditions on word occurrences or nonoccurrences. The proposed approaches vary in the complexity of the format of the rules; RGC employs disjunctions and conjunctions in rule generation whereas RGC-D rules are simple disjunctions of conditions signifying presence of various words. In both the cases, each cluster is comprised of precisely the set of documents that satisfy the corresponding rule. Rules of the latter kind are easy to interpret, whereas the former leads to more accurate clustering. We show that our approaches outperform the unsupervised decision tree approach for rule-generating clustering and also an approach we provide for generating interpretable models for general clusterings, both by significant margins. We empirically show that the purity and f-measure losses to achieve interpretability can be as little as 3 and 5%, respectively using the algorithms presented herein.
机译:通过聚类算法输出的文本文档的聚类通常难以解释。我们描述了激励现实世界的场景,这些场景需要集群的可重新配置性和高度可解释性,并概述了使用可解释和可重新配置的集群模型生成集群的问题。我们针对构建可解释和可重新配置的群集模型的概述目标开发了两种群集算法。它们生成具有相关规则的聚类,这些规则由单词出现或不出现的条件组成。提议的方法在规则格式的复杂性方面有所不同。 RGC在规则生成中采用析取和连词,而RGC-D规则是表示存在各种单词的条件的简单析取。在这两种情况下,每个群集都精确地由满足相应规则的一组文档组成。后者的规则易于解释,而前者则导致更准确的聚类。我们表明,我们的方法优于无监督决策树方法的规则生成聚类方法,还提供了为常规聚类生成可解释模型的方法,两者均具有显着优势。我们凭经验表明,使用本文介绍的算法,实现可解释性的纯度和f度量损失分别低至3%和5%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号