【24h】

Categorizing and Mining Concept Drifting Data Streams

机译:分类和挖掘概念漂移数据流

获取原文
获取外文期刊封面目录资料

摘要

Mining concept drifting data streams is a defining challenge for data mining research. Recent years have seen a large body of work on detecting changes and building prediction models from stream data, with a vague understanding on the types of the concept drifting and the impact of different types of concept drifting on the mining algorithms. In this paper, we first categorize concept drifting into two scenarios: Loose Concept Drifting (LCD) and Rigorous Concept Drifting (RCD), and then propose solutions to handle each of them separately. For LCD data streams, because concepts in adjacent data chunks are sufficiently close to each other, we apply kernel mean matching (KMM) method to minimize the discrepancy of the data chunks in the kernel space. Such a minimization process will produce weighted instances to build classifier ensemble and handle concept drifting data streams. For RCD data streams, because genuine concepts in adjacent data chunks may randomly and rapidly change, we propose a new Optimal Weights Adjustment (OWA) method to determine the optimum weight values for classifiers trained from the most recent (up-to-date) data chunk, such that those classifiers can form an accurate classifier ensemble to predict instances in the yet-to-come data chunk. Experiments on synthetic and real-world datasets will show that weighted instance approach is preferable when the concept drifting is mainly caused by the changing of the class prior probability; whereas the weighted classifier approach is preferable when the concept drifting is mainly triggered by the changing of the conditional probability.
机译:挖掘概念漂移数据流是数据挖掘研究面临的一项决定性挑战。近年来,在从流数据中检测变化和建立预测模型方面,已有大量工作开展,对概念漂移的类型以及不同类型的概念漂移对挖掘算法的影响尚不甚了解。在本文中,我们首先将概念漂移分为两种情况:宽松概念漂移(LCD)和严格概念漂移(RCD),然后提出解决方案以分别处理它们。对于LCD数据流,由于相邻数据块中的概念彼此足够接近,因此我们应用内核均值匹配(KMM)方法来最小化内核空间中数据块的差异。这样的最小化过程将产生加权实例,以建立分类器集合并处理概念漂移数据流。对于RCD数据流,由于相邻数据块中的真实概念可能会随机快速变化,因此我们提出了一种新的最佳权重调整(OWA)方法,以确定从最新(最新)数据中训练出来的分类器的最佳权重值块,这样这些分类器就可以形成一个准确的分类器集合,以预测尚未出现的数据块中的实例。在综合和真实数据集上进行的实验表明,当概念漂移主要是由类先验概率的变化引起的时,加权实例方法是更可取的。当概念漂移主要是由条件概率的变化触发时,加权分类器方法是更可取的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号