首页> 外文会议>ACM conference on information and knowledge management >SKIF: A Data Imputation Framework for Concept Drifting Data Streams
【24h】

SKIF: A Data Imputation Framework for Concept Drifting Data Streams

机译:SKIF:用于概念漂移数据流的数据归纳框架

获取原文

摘要

Missing data commonly occur in many applications. While many data imputation methods exist to handle the missing data problem for databases, when applied to concept drifting data streams, these methods share some common difficulties. First, due to large and continuous data volumes, we are unable to maintain all stream records to form a candidate pool for missing value estimation, as most existing methods commonly do. Second, even if we could maintain all complete stream records using a summary structure, the concept drifting problem would make some information obsolete, and thus deteriorate the imputation accuracy. Third, in data streams, it is necessary to develop a fast yet accurate algorithm to find most similar data for imputation. Fourth, due to dynamic and sophisticated data collection environments, the missing rate of most stream data may be much higher than that in databases, so the imputation method should be able to handle high missing rate in the data. To tackle these challenges, we propose a Streaming k-Nearest-Neighbors Imputation Framework (SKIF) for concept drifting data streams. To handle concept drifting and large volume problems in data streams, SKIF first summarizes historical complete records in some micro-resources (which are high-level statistical data structures), and maintains these micro-resources in a candidate pool as benchmark data. After that, SKIF employs a novel hybrid-kNN imputation procedure, which uses a hybrid similarity search mechanism, to find the most similar micro-resources from the large scale candidate pool efficiently. Experimental results demonstrate the effectiveness of the proposed SKIF framework for data stream imputation tasks.
机译:许多应用程序中常见的数据缺失。虽然存在许多数据归纳方法来处理数据库的缺失数据问题,但在应用于概念漂移数据流时,这些方法共享一些常见的困难。首先,由于大量和连续的数据卷,我们无法维护所有流记录以形成候选池以缺少值估计,因为大多数现有方法通常是这样做的。其次,即使我们可以使用摘要结构维护所有完整的流记录,概念漂移问题会使一些信息过时,从而降低了归属精度。第三,在数据流中,有必要开发一种快速且准确的算法,以找到估算的大多数数据。第四,由于动态和复杂的数据收集环境,大多数流数据的缺失率可能远高于数据库中的速率,因此归责方法应该能够处理数据中的高缺失率。为了解决这些挑战,我们提出了一种媒体K-Indection-邻居归责框架(SKIF),用于概念漂移数据流。为了处理数据流中的概念漂移和大体积问题,SKIF首先总结了一些微资源(这是高级统计数据结构)的历史完整记录,并将这些微资源维护在候选池中作为基准数据。之后,SKIF采用一种新颖的混合核对归责程序,它使用混合相似性搜索机制,以有效地从大规模候选池中找到最相似的微资源。实验结果表明了建议的SKIF框架用于数据流归档任务的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号