首页> 外文会议>IEEE international conference on data engineering >Scalable distance-based outlier detection over high-volume data streams
【24h】

Scalable distance-based outlier detection over high-volume data streams

机译:在大容量数据流上可扩展的基于距离的离群值检测

获取原文

摘要

The discovery of distance-based outliers from huge volumes of streaming data is critical for modern applications ranging from credit card fraud detection to moving object monitoring. In this work, we propose the first general framework to handle the three major classes of distance-based outliers in streaming environments, including the traditional distance-threshold based and the nearest-neighbor-based definitions. Our LEAP framework encompasses two general optimization principles applicable across all three outlier types. First, our “minimal probing” principle uses a lightweight probing operation to gather minimal yet sufficient evidence for outlier detection. This principle overturns the state-of-the-art methodology that requires routinely conducting expensive complete neighborhood searches to identify outliers. Second, our “lifespan-aware prioritization” principle leverages the temporal relationships among stream data points to prioritize the processing order among them during the probing process. Guided by these two principles, we design an outlier detection strategy which is proven to be optimal in CPU costs needed to determine the outlier status of any data point during its entire life. Our comprehensive experimental studies, using both synthetic as well as real streaming data, demonstrate that our methods are 3 orders of magnitude faster than state-of-the-art methods for a rich diversity of scenarios tested yet scale to high dimensional streaming data.
机译:从大量流数据中发现基于距离的离群值对于从信用卡欺诈检测到移动对象监视等现代应用而言至关重要。在这项工作中,我们提出了第一个通用框架来处理流环境中基于距离的离群值的三大类,包括传统的基于距离阈值的定义和基于最近邻的定义。我们的LEAP框架包含适用于所有三种异常值类型的两项通用优化原则。首先,我们的“最小探测”原理使用轻量级的探测操作来收集最小但足够的证据来进行离群值检测。该原则推翻了需要定期进行昂贵的完整邻域搜索以识别异常值的最新方法。其次,我们的“可识别生命的优先级”原则利用了流数据点之间的时间关系,从而在探测过程中对它们之间的处理顺序进行了优先排序。在这两个原则的指导下,我们设计了一种异常值检测策略,该策略被证明是确定所有数据点在其整个生命周期内所需的异常值所需的CPU成本最佳的策略。我们对综合和真实流数据都进行了全面的实验研究,结果表明,对于经过测试的各种场景,我们的方法都比最先进的方法快3个数量级,但可以扩展到高维流数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号