首页> 外文期刊>Information Systems >Distributed clustering of categorical data using the information bottleneck framework
【24h】

Distributed clustering of categorical data using the information bottleneck framework

机译:使用信息瓶颈框架的分类数据的分布式聚类

获取原文
获取原文并翻译 | 示例
       

摘要

We perform clustering of categorical data using the Information Bottleneck, (IB), framework at large scale. We examine the performance of existing solutions using multiple machine architectures. The IB method uses information theory to recast database relations as probability distributions and the proximity of their tuples as their loss of information when they are considered together. More precisely, we study the Agglomerative Information Bottleneck, the Sequential Information Bottleneck and LIMBO, a newer approach that uses summaries of the original data. First we evaluate the performance and limitations of these algorithms when confronted with large datasets in a single, powerful machine. We then propose new implementations that take advantage of distributed environments. Using real and large synthetic datasets of tens of Gigabytes in size, we finally evaluate their effectiveness and efficiency. (C) 2017 Elsevier Ltd. All rights reserved.
机译:我们使用信息瓶颈(IB)框架大规模执行分类数据的聚类。我们使用多种机器架构检查现有解决方案的性能。 IB方法使用信息论来将数据库关系重现为概率分布,而将元组的邻近度重现为信息丢失(将它们一起考虑时)。更准确地说,我们研究了聚集信息瓶颈,顺序信息瓶颈和LIMBO,这是一种使用原始数据摘要的更新方法。首先,我们在一台功能强大的机器中面对大型数据集时,评估了这些算法的性能和局限性。然后,我们提出利用分布式环境的新实现。我们最终使用数十千兆字节的真实和大型综合数据集来评估其有效性和效率。 (C)2017 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号