A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries

机译：基于数据摘要的分类数据基于谱的聚类算法

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present a novel spectral-based algorithm for clustering categorical data that combines attribute relationship and dimension reduction techniques found in Principal Component Analysis (PCA) and Latent Semantic Indexing (LSI). The new algorithm uses data summaries that consist of attribute occurrence and co-occurrence frequencies to create a set of vectors each of which represents a cluster. We refer to these vectors as "candidate cluster representatives." The algorithm also uses spectral decomposition of the data summaries matrix to project and cluster the data objects in a reduced space. We refer to the algorithm as SCCADDS (Spectral-based Clustering algorithm for CAtegorical Data using Data Summaries). SCCADDS differs from other spectral clustering algorithms in several key respects. First, the algorithm uses the attribute categories similarity matrix instead of the data object similarity matrix (as is the case with most spectral algorithms that find the normalized cut of a graph of nodes of data objects). SCCADDS scales well for large datasets since in most categorical clustering applications the number of attribute categories is small relative to the number of data objects. Second, non-recursive spectral-based clustering algorithms typically require K-means or some other iterative clustering method after the data objects have been projected into a reduced space. SCCADDS clusters the data objects directly by comparing them to candidate cluster representatives without the need for an iterative clustering method. Third, unlike standard spectral-based algorithms, the complexity of SCCADDS is linear in terms of the number of data objects. Results on datasets widely used to test categorical clustering algorithms show that SCCADDS produces clusters that are consistent with those produced by existing algorithms, while avoiding the computation of the spectra of large matrices and problems inherent in methods that employ the K-means type algorithms.

机译：我们提出了一种新颖的基于光谱的聚类分类数据算法，该算法结合了在主成分分析（PCA）和潜在语义索引（LSI）中发现的属性关系和降维技术。新算法使用由属性出现和共现频率组成的数据摘要来创建一组向量，每个向量代表一个簇。我们将这些向量称为“候选簇代表”。该算法还使用数据汇总矩阵的频谱分解来在缩小的空间中投影和聚类数据对象。我们将该算法称为SCCADDS（使用数据摘要的分类数据基于光谱的聚类算法）。 SCCADDS在几个关键方面与其他频谱聚类算法不同。首先，该算法使用属性类别相似度矩阵而不是数据对象相似度矩阵（大多数频谱算法都是这样，即找到数据对象节点图的归一化割线。 SCCADDS对于大型数据集可以很好地扩展，因为在大多数类别聚类应用程序中，属性类别的数量相对于数据对象的数量而言较小。其次，在将数据对象投影到缩小的空间之后，基于非递归光谱的聚类算法通常需要K-means或其他迭代式聚类方法。 SCCADDS通过将数据对象与候选群集代表进行比较来直接对数据对象进行群集，而无需使用迭代群集方法。第三，与基于标准频谱的算法不同，SCCADDS的复杂度在数据对象数量方面是线性的。广泛用于测试分类聚类算法的数据集结果表明，SCCADDS产生的聚类与现有算法产生的聚类一致，同时避免了大型矩阵光谱的计算以及采用K-means类型算法的方法固有的问题。

著录项

来源
《Workshop on data mining using matrices and tensors 2009》|2009年|P.9-16|共8页
会议地点 Paris(FR);Paris(FR)
作者
Eman Abdu; Douglas Salane;
展开▼
作者单位

Computer Science Department The Graduate Center The City University of New York;

rnMathematics Computer Science John Jay College of Criminal Justice The City University of New York;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类 TP311.13;
关键词
spectral algorithms; categorical data;

机译：频谱算法分类数据;

相似文献

外文文献
中文文献
专利

1. 一种基于谱分量相似度的多数据流聚类算法 [J] . 邹凌君, 陈峻, 屠莉东南大学学报（英文版） . 2008,第003期
2. Weighted Delta Factor Cluster Ensemble Algorithm for Categorical Data Clustering in Data Mining [J] . Sengottaian Sarumathi, Natesan Shanthi, Mathivanan Sharmila The international arab journal of information technology . 2017,第3期

机译：数据挖掘中分类数据聚类的加权增量因子聚类集成算法
3. GACC: genetic algorithm-based categorical data clustering for large datasets [J] . Abha Sharma, R.S. Thakur International journal of data mining, modelling and management . 2017,第4期

机译：GACC：用于大型数据集的基于遗传算法的分类数据聚类
4. Fast Density Clustering Algorithm for Numerical Data and Categorical Data [J] . Chen Jinyin, He Huihao, Chen Jungan, Mathematical Problems in Engineering . 2017,第期

机译：数值数据和分类数据的快速密度聚类算法
5. Cactus-Clustering Categorical Data Using Summaries [C] . Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan ACM SIGKDD international conference on knowledge discovery and data mining . 1999

机译：使用摘要对仙人掌进行分类的数据
6. Clustering categorical data using data summaries and spectral techniques. [D] . Abdu, Eman. 2009

机译：使用数据摘要和频谱技术对分类数据进行聚类。
7. Evaluation of Modified Categorical Data Fuzzy Clustering Algorithm on the Wisconsin Breast Cancer Dataset [O] . Amir Ahmad 2016

机译：改进的分类数据模糊聚类算法对威斯康星州乳腺癌数据集的评估
8. A Fuzzy Centroids Clustering Algorithm with Between-cluster Information for Categorical Data [O] . Wang Li-Na, Liu Qian, Zhou Yuan 2013

机译：一种模糊质心聚类算法与分类数据的集群信息

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries

摘要

著录项

相似文献

相关主题

期刊订阅