【24h】

LIMBO: Scalable Clustering of Categorical Data

机译:LIMBO:分类数据的可扩展群集

获取原文
获取原文并翻译 | 示例

摘要

Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. We introduce LIMBO, a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. As a hierarchical algorithm, LIMBO has the advantage that it can produce clusterings of different sizes in a single execution. We use the IB framework to define a distance measure for categorical tuples and we also present a novel distance measure for categorical attribute values. We show how the LIMBO algorithm can be used to cluster both tuples and values. LIMBO handles large data sets by producing a memory bounded summary model for the data. We present an experimental evaluation of LIMBO, and we study how clustering quality compares to other categorical clustering algorithms. LIMBO supports a trade-off between efficiency (in terms of space and time) and quality. We quantify this trade-off and demonstrate that LIMBO allows for substantial improvements in efficiency with negligible decrease in quality.
机译:群集是在许多应用程序中非常重要的问题。当数据是分类的时,即在数据值之间没有固有的距离度量时,聚类的问题变得更具挑战性。我们介绍LIMBO,这是一种可扩展的层次分类聚类算法,它建立在信息瓶颈(IB)框架上,用于量化聚类时保留的相关信息。作为分层算法,LIMBO的优点是可以在一次执行中产生不同大小的聚类。我们使用IB框架为分类元组定义距离度量,并且我们还为分类属性值提供了一种新颖的距离度量。我们展示了如何使用LIMBO算法对元组和值进行聚类。 LIMBO通过为数据生成内存限制的摘要模型来处理大型数据集。我们提出了LIMBO的实验评估,并且研究了聚类质量与其他分类聚类算法的比较。 LIMBO支持在效率(在空间和时间方面)和质量之间进行权衡。我们量化了这种权衡,并证明LIMBO可以显着提高效率,而质量下降可忽略不计。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号