...
首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets
【24h】

Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets

机译:高维离散属性数据集中的压缩,聚类和模式发现

获取原文
获取原文并翻译 | 示例

摘要

This paper presents an efficient framework for error-bounded compression of high-dimensional discrete-attribute data sets. Such data sets, which frequently arise in a wide variety of applications, pose some of the most significant challenges in data analysis. Subsampling and compression are two key technologies for analyzing these data sets. The proposed framework, PROXIMUS, provides a technique for reducing large data sets into a much smaller set of representative patterns, on which traditional (expensive) analysis algorithms can be applied with minimal loss of accuracy. We show desirable properties of PROXIMUS in terms of runtime, scalability to large data sets, and performance in terms of capability to represent data in a compact form and discovery and interpretation of interesting patterns. We also demonstrate sample applications of PROXIMUS in association rule mining and semantic classification of term-document matrices. Our experimental results on real data sets show that use of the compressed data for association rule mining provides excellent precision and recall values (above 90 percent) across a range of problem parameters while reducing the time required for analysis drastically. We also show excellent interpretability of the patterns discovered by PROXIMUS in the context of clustering and classification of terms and documents. In doing so, we establish PROXIMUS as a tool for both preprocessing data before applying computationally expensive algorithms and directly extracting correlated patterns.
机译:本文为高维离散属性数据集的错误边界压缩提供了一个有效的框架。这种数据集经常在各种应用中出现,在数据分析中提出了一些最重大的挑战。二次采样和压缩是分析这些数据集的两项关键技术。提议的框架PROXIMUS提供了一种将大型数据集简化为代表性模式集的技术,可以在传统模式(昂贵)的分析算法上以最小的准确性损失来应用。我们从运行时,对大型数据集的可伸缩性以及在以紧凑形式表示数据以及发现和解释有趣的模式的能力方面表现出PROXIMUS的理想属性。我们还演示了PROXIMUS在关联规则挖掘和术语文档矩阵的语义分类中的示例应用。我们在真实数据集上的实验结果表明,将压缩数据用于关联规则挖掘可在一系列问题参数范围内提供出色的精度和召回率(超过90%),同时大大减少了分析所需的时间。我们还展示了PROXIMUS在术语和文档的聚类和分类中发现的模式的出色可解释性。为此,我们将PROXIMUS建立为一种工具,既可以在应用计算昂贵的算法之前对数据进行预处理,又可以直接提取相关模式。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号