Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets

Koyuturk M.; Grama A.; Ramakrishnan N.

首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets

【24h】

Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets

机译：高维离散属性数据集中的压缩，聚类和模式发现

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

This paper presents an efficient framework for error-bounded compression of high-dimensional discrete-attribute data sets. Such data sets, which frequently arise in a wide variety of applications, pose some of the most significant challenges in data analysis. Subsampling and compression are two key technologies for analyzing these data sets. The proposed framework, PROXIMUS, provides a technique for reducing large data sets into a much smaller set of representative patterns, on which traditional (expensive) analysis algorithms can be applied with minimal loss of accuracy. We show desirable properties of PROXIMUS in terms of runtime, scalability to large data sets, and performance in terms of capability to represent data in a compact form and discovery and interpretation of interesting patterns. We also demonstrate sample applications of PROXIMUS in association rule mining and semantic classification of term-document matrices. Our experimental results on real data sets show that use of the compressed data for association rule mining provides excellent precision and recall values (above 90 percent) across a range of problem parameters while reducing the time required for analysis drastically. We also show excellent interpretability of the patterns discovered by PROXIMUS in the context of clustering and classification of terms and documents. In doing so, we establish PROXIMUS as a tool for both preprocessing data before applying computationally expensive algorithms and directly extracting correlated patterns.

机译：本文为高维离散属性数据集的错误边界压缩提供了一个有效的框架。这种数据集经常在各种应用中出现，在数据分析中提出了一些最重大的挑战。二次采样和压缩是分析这些数据集的两项关键技术。提议的框架PROXIMUS提供了一种将大型数据集简化为代表性模式集的技术，可以在传统模式（昂贵）的分析算法上以最小的准确性损失来应用。我们从运行时，对大型数据集的可伸缩性以及在以紧凑形式表示数据以及发现和解释有趣的模式的能力方面表现出PROXIMUS的理想属性。我们还演示了PROXIMUS在关联规则挖掘和术语文档矩阵的语义分类中的示例应用。我们在真实数据集上的实验结果表明，将压缩数据用于关联规则挖掘可在一系列问题参数范围内提供出色的精度和召回率（超过90％），同时大大减少了分析所需的时间。我们还展示了PROXIMUS在术语和文档的聚类和分类中发现的模式的出色可解释性。为此，我们将PROXIMUS建立为一种工具，既可以在应用计算昂贵的算法之前对数据进行预处理，又可以直接提取相关模式。

著录项

来源
《IEEE Transactions on Knowledge and Data Engineering》 |2005年第4期|p.447-461|共15页
作者
Koyuturk M.; Grama A.; Ramakrishnan N.;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
data compression; data mining; pattern classification; singular value decomposition; very large databases; association rule mining; data analysis; data classification; data clustering; data compression; data mining; discrete-attribute data sets; pattern discovery; si;

机译：数据压缩;数据挖掘;模式分类;奇异值分解;大型数据库;关联规则挖掘;数据分析;数据分类;数据聚类;数据压缩;数据挖掘;离散属性数据集;模式发现;si;

相似文献

外文文献
中文文献
专利

1. The Clustered Causal State Algorithm: Efficient Pattern Discovery for Lossy Data-Compression Applications [J] . Schmiedekamp M., Subbu A., Phoha S. Computing in science & engineering . 2006,第5期

机译：聚类因果状态算法：有损数据压缩应用程序的有效模式发现
2. Discovery of Patterns and evaluation of Clustering Algorithms in SocialNetwork Data (Face book 100 Universities) through Data Mining Techniques and Methods [J] . Nancy.P, R.Geetha Ramani International Journal of Data Mining & Knowledge Management Process . 2012,第5期

机译：通过数据挖掘技术和方法发现社交网络数据（Facebook 100大学）中的模式并评估聚类算法
3. Interactive Pattern Discovery in High-Dimensional, Multimodal Data Using Manifolds [J] . Jinhong K. Guo, Martin O. Hofmann Procedia Computer Science . 2017,第期

机译：使用歧管的高维，多模式数据发现交互式模式发现
4. Data-Pattern Discovery Methods for Detection in Nongaussian High-dimensional Data Sets [C] . Cecile Levasseur, Kenneth Kreutz-Delgado, Uwe Mayer, Asilomar Conference on Signals, Systems and Computers . 2006

机译：Nongaussian高维数据集检测的数据模式发现方法
5. Efficient computation of k-nearest neighbor graphs for large high-dimensional data sets on gpu clusters. [D] . Dashti, Ali. 2013

机译：有效计算gpu群集上的大型高维数据集的k最近邻图。
6. Efficient Computation of k-Nearest Neighbour Graphs for Large High-Dimensional Data Sets on GPU Clusters [O] . Ali Dashti, Ivan Komarov, Roshan M. D’Souza -1

机译：GPU群集上大型高维数据集的k最近邻图的高效计算
7. Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets [O] . Mehmet Koyutürk, Ananth Grama, Naren Ramakrishnan 2008

机译：高维离散属性数据集中的压缩，聚类和模式发现

Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅