Aiming at the problems caused by the nonlinear relations between the attributes of high dimensional data in cluster analysis,such as uneven distribution of data,invalidation of traditional similarity measures and difficulty of accurate representation of the result class,a clustering algorithm for high dimensional nonlinear feature data is proposed based on kernel principal component analysis (KPCA) and density clustering (DBSCAN).To extract the nonlinear characteristics of high dimensional data,the KPCA theory is adopted to map the original to a higher dimensional data space,thus a set of directions in principal component spacePCS for extracting the nonlinear characteristics of data and reduced dimensions can be obtained.The similarity distance of data in PCS is defined to improve the traditional DBSCAN clustering algorithm and 3δ statistical theory is used to characterize the clustering results.A case of hypertension group clustering is provided to illustrate the feasibility of the proposed method,and the results show that the proposed method can effectively obtain the nonlinear characteristics of the high dimensional data and realize cluster analysis and cluster center knowledge expression to solve the difficulties in the traditional DBSCAN clustering method for cluster analysis of high dimensional data.%针对高维数据聚类分析中数据之间具有多种非线性特征关系,导致数据分布不均、传统相似性度量失效及结果类中心难以精准表征等问题,提出了一种基于核主元分析(KPCA)与密度聚类(DBSCAN)的高维非线性特征数据聚类分析技术.首先,为有效提取高维数据的非线性特征,利用KPCA理论将原始数据映射到更高维数据空间,利用主元分析获得数据变化的方向集合,并进行降维分析;然后,通过重新定义数据样本在主元空间的相似性距离对传统DBSCAN聚类方法进行改进,并利用3δ统计理论对各簇中心的进行表征,从而实现高维数据的精确分类与类中心知识表达.以实际高血压患者群体聚类问题为例对方法进行了有效性验证,实验表明,所提方法可以有效获取原始数据的非线性特征,实现患者个体特征群体的有效划分及簇类中心知识的表达,解决传统DBSCAN聚类方法对高维数据不适用的问题.
展开▼