A new algorithm for clustering multiple data streams is proposed. The algorithm can effectively cluster data streams which show similar behavior with some unknown time delays.The algorithm uses the autoregressive (AR) modeling technique to measure correlations between data streams. It exploits estimated frequencies spectra to extract the essential features of streams. Each stream is represented as the sum of spectral components and the correlation is measured component-wise.Each spectral component is described by four parameters,namely, amplitude, phase, damping rate and frequency. The ε-lag-correlation between two spectral components is calculated. The algorithm uses such information as similarity measures in clustering data streams. Based on a sliding window model, the algorithm can continuously report the most recent clustering results and adjust the number of clusters. Experiments on real and synthetic streams show that the proposed clustering method has a higher speed and clustering quality than other similar methods.%提出了一种新的多数据流聚类算法.该算法可以有效地对有相似行为但存在一定时间延迟的多数据流进行聚类.算法采用自回归模型技术度量数据流间的延迟相关,利用频谱估计来抽取数据流的特征.每一个数据流用其谱分量的和来表示,从而来计算每对数据流间的相关关系.每个谱分量用振幅、相位、衰减率、频率4个参数来描述.算法计算谱分量对之间的ε-延时相关关系,并以此为基础来得到聚类分析中数据流间距离的度量.此外,算法采用滑动窗口技术对多数据流进行聚类,实时地得出聚类结果且动态地调节聚类的个数.在人工数据集和实际数据集上的实验结果表明,所提出的算法比其他类似的算法具有更快的速度和更好的聚类效果.
展开▼