首页> 外文会议>Workshop on data mining using matrices and tensors 2009 >A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries
【24h】

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries

机译:基于数据摘要的分类数据基于谱的聚类算法

获取原文
获取原文并翻译 | 示例

摘要

We present a novel spectral-based algorithm for clustering categorical data that combines attribute relationship and dimension reduction techniques found in Principal Component Analysis (PCA) and Latent Semantic Indexing (LSI). The new algorithm uses data summaries that consist of attribute occurrence and co-occurrence frequencies to create a set of vectors each of which represents a cluster. We refer to these vectors as "candidate cluster representatives." The algorithm also uses spectral decomposition of the data summaries matrix to project and cluster the data objects in a reduced space. We refer to the algorithm as SCCADDS (Spectral-based Clustering algorithm for CAtegorical Data using Data Summaries). SCCADDS differs from other spectral clustering algorithms in several key respects. First, the algorithm uses the attribute categories similarity matrix instead of the data object similarity matrix (as is the case with most spectral algorithms that find the normalized cut of a graph of nodes of data objects). SCCADDS scales well for large datasets since in most categorical clustering applications the number of attribute categories is small relative to the number of data objects. Second, non-recursive spectral-based clustering algorithms typically require K-means or some other iterative clustering method after the data objects have been projected into a reduced space. SCCADDS clusters the data objects directly by comparing them to candidate cluster representatives without the need for an iterative clustering method. Third, unlike standard spectral-based algorithms, the complexity of SCCADDS is linear in terms of the number of data objects. Results on datasets widely used to test categorical clustering algorithms show that SCCADDS produces clusters that are consistent with those produced by existing algorithms, while avoiding the computation of the spectra of large matrices and problems inherent in methods that employ the K-means type algorithms.
机译:我们提出了一种新颖的基于光谱的聚类分类数据算法,该算法结合了在主成分分析(PCA)和潜在语义索引(LSI)中发现的属性关系和降维技术。新算法使用由属性出现和共现频率组成的数据摘要来创建一组向量,每个向量代表一个簇。我们将这些向量称为“候选簇代表”。该算法还使用数据汇总矩阵的频谱分解来在缩小的空间中投影和聚类数据对象。我们将该算法称为SCCADDS(使用数据摘要的分类数据基于光谱的聚类算法)。 SCCADDS在几个关键方面与其他频谱聚类算法不同。首先,该算法使用属性类别相似度矩阵而不是数据对象相似度矩阵(大多数频谱算法都是这样,即找到数据对象节点图的归一化割线。 SCCADDS对于大型数据集可以很好地扩展,因为在大多数类别聚类应用程序中,属性类别的数量相对于数据对象的数量而言较小。其次,在将数据对象投影到缩小的空间之后,基于非递归光谱的聚类算法通常需要K-means或其他迭代式聚类方法。 SCCADDS通过将数据对象与候选群集代表进行比较来直接对数据对象进行群集,而无需使用迭代群集方法。第三,与基于标准频谱的算法不同,SCCADDS的复杂度在数据对象数量方面是线性的。广泛用于测试分类聚类算法的数据集结果表明,SCCADDS产生的聚类与现有算法产生的聚类一致,同时避免了大型矩阵光谱的计算以及采用K-means类型算法的方法固有的问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号