首页> 外文学位 >Online detection of outliers for data streams.
【24h】

Online detection of outliers for data streams.

机译:在线检测数据流离群值。

获取原文
获取原文并翻译 | 示例

摘要

In applications, such as Web clicks and environmental monitoring, data are in the form of a stream, each of which is an infinite sequence of data points with explicit or implicit timestamps and has special characteristics, such as transiency, uncertainty, dynamic data distribution, multi-dimensionality, asynchronous data arrival, dynamic relationships, and schema heterogeneity of data from different sources. In those applications, outliers do exist due to many reasons including human error, instrument error, catastrophe, and malicious behavior. Being able to detect outliers effectively is critical to many data management and mining tasks. However, not much research has been conducted to discover outliers in data stream applications, especially for those involving multi-dimensionality, related, heterogeneous, and asynchronous streams.;In this dissertation, two innovative outlier detection algorithms, Orion and Wadjet, which take all the data streams' characteristics into consideration are presented. Orion is designed for applications where data are from single stream. It looks for a projected dimension that reveals the outlier nature of multi-dimensional data points with the help of an evolutionary algorithm, and identifies a data point as an outlier if it resides in a low density region in that dimension. Wadjet is designed for applications where data are from multiple, heterogeneous, and asynchronous streams. It has two phases: in the first phase, it processes each stream independently like Orion, and in the second phase, it captures and continuously evaluates the cross-correlation, if any, among the data points from multiple streams, and identifies a data point as an outlier if its value does not conform to the captured cross-correlation. Extensive theoretical and empirical analyses have been conducted to evaluate the performance of Orion and Wadjet using real and synthetic datasets. The evaluation results show that both algorithms have better accuracy and execution time than the state-of-art techniques when applied to homogeneous data stream applications. The results also show that Wadjet is effective in detecting outliers in heterogeneous data streams which cannot be handled by existing algorithms.
机译:在应用程序中,例如Web点击和环境监控,数据以流的形式出现,每个流都是无限的数据点序列,带有显式或隐式时间戳,并具有特殊性,例如瞬态,不确定性,动态数据分发,多维性,异步数据到达,动态关系以及来自不同来源的数据的架构异质性。在那些应用程序中,确实存在异常值是由于许多原因,包括人为错误,仪器错误,灾难和恶意行为。有效检测离群值对于许多数据管理和挖掘任务至关重要。但是,尚未进行大量研究来发现数据流应用程序中的异常值,特别是对于那些涉及多维,相关,异构和异步流的异常值。;本论文中,两种创新的异常值检测算法Orion和Wadjet将所有提出了考虑数据流的特性。 Orion专为数据来自单个流的应用程序而设计。它寻找一种投影维,该维借助进化算法揭示多维数据点的异常值,如果数据点位于该维的低密度区域中,则将其识别为异常值。 Wadjet专为数据来自多个,异构和异步流的应用程序而设计。它具有两个阶段:在第一阶段,它像Orion一样独立地处理每个流,在第二阶段,它捕获并连续评估来自多个流的数据点之间的互相关(如果有),并标识一个数据点如果其值不符合捕获的互相关,则为离群值。已经进行了广泛的理论和经验分析,以使用实际和合成数据集评估Orion和Wadjet的性能。评估结果表明,当应用于同类数据流应用程序时,这两种算法都比最新技术具有更好的准确性和执行时间。结果还表明,Wadjet可有效检测异构数据流中的异常值,而现有算法无法处理这些异常值。

著录项

  • 作者

    Sadik, Md. Shiblee.;

  • 作者单位

    The University of Oklahoma.;

  • 授予单位 The University of Oklahoma.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 289 p.
  • 总页数 289
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号