In several application domains, high-dimensional observations are collectedand then analysed in search for naturally occurring data clusters which mightprovide further insights about the nature of the problem. In this paper wedescribe a new approach for partitioning such high-dimensional data. Ourassumption is that, within each cluster, the data can be approximated well by alinear subspace estimated by means of a principal component analysis (PCA). Theproposed algorithm, Predictive Subspace Clustering (PSC) partitions the datainto clusters while simultaneously estimating cluster-wise PCA parameters. Thealgorithm minimises an objective function that depends upon a new measure ofinfluence for PCA models. A penalised version of the algorithm is alsodescribed for carrying our simultaneous subspace clustering and variableselection. The convergence of PSC is discussed in detail, and extensivesimulation results and comparisons to competing methods are presented. Thecomparative performance of PSC has been assessed on six real gene expressiondata sets for which PSC often provides state-of-art results.
展开▼