Bayesian Nonparametric Unsupervised Concept Drift Detection for Data Stream Mining

Xuan Junyu; Lu Jie; Zhang Guangquan

摘要

Online data stream mining is of great significance in practice because of its ubiquity in many real-world scenarios, especially in the big data era. Traditional data mining algorithms cannot be directly applied to data streams due to (1) the possible change of underlying data distribution over time (i.e., concept drift) and (2) delayed, short, or even no labels for streaming data in practice. A new research area, named unsupervised concept drift detection, has emerged to tackle this difficulty mainly based on two-sample hypothesis tests, such as the Kolmogorov-Smirnov test. However, it is surprising that none of the existing methods in this area exploit the Bayesian nonparametric hypothesis test, which has clear interpretability and straightforward prior knowledge encoding ability and no strict or unrealistic requirement of prefixing the form for the underlying data distribution. In this article, we present a Bayesian nonparametric unsupervised concept drift detection method based on the Polya tree hypothesis test. The basic idea is to decompose the underlying data distribution into a multi-resolution representation that transforms the whole distribution hypothesis test into recursive and simple binomial tests. Also, an incremental mechanism is especially designed to improve its efficiency in the stream setting. The method effectively detect drifts, and it also locates where a drift happens and the posteriors of hypotheses. The experiments on synthetic data verify the desired properties of the proposed method, and the experiments on real-world data show the better performance of the method for data stream mining compared with its frequentist counterpart in the literature.

机译：在线数据流挖掘在实践中具有重要意义，因为它在许多真实情景中的无处不在，特别是在大数据时代。由于（1）（1）随着时间的推移（即概念漂移）和（2）延迟，短，甚至没有用于在实践中流媒体数据的可能变化，不能直接应用于数据流的数据流。一个名为无监督概念漂移检测的新研究区，已经出现了主要基于两种样本假设试验，例如Kolmogorov-Smirnov试验来解决这种困难。然而，令人惊讶的是，这一领域的现有方法都没有利用贝叶斯非参数假设试验，这具有明显的解释性和直接的先验知识编码能力，并且对基础数据分布的形式的前缀没有严格或不切实际的要求。在本文中，我们介绍了一种基于Polya树假设试验的贝叶斯非参数无监督概念漂移检测方法。基本思想是将底层数据分布分解为多分辨率表示，将整个分布假说测试转换为递归和简单的二项式测试。此外，尤其旨在提高其在流设置中的效率。该方法有效地检测漂移，并且它也定位在漂移发生的地方和假设的后部。合成数据的实验验证了所提出的方法的所需性质，与实际数据的实验表明，与文献中的频繁表现相比，数据流挖掘方法的性能更好。

Bayesian Nonparametric Unsupervised Concept Drift Detection for Data Stream Mining

摘要

著录项

引文网络

相关主题

期刊订阅