首页> 外文会议>ACM SIGMOD international conference on management of data >Fast Approximate Correlation for Massive Time-series Data
【24h】

Fast Approximate Correlation for Massive Time-series Data

机译:大规模时序数据的快速近似相关性

获取原文

摘要

We consider the problem of computing all-pair correlations in a warehouse containing a large number (e.g., tens of thousands) of time-series (or, signals). The problem arises in automatic discovery of patterns and anomalies in data intensive applications such as data center management, environmental monitoring, and scientific experiments. However, with existing techniques, solving the problem for a large stream warehouse is extremely expensive, due to the problem's inherent quadratic I/O and CPU complexities. We propose novel algorithms, based on Discrete Fourier Transformation (DFT) and graph partitioning, to reduce the end-to-end response time of an all-pair correlation query. To minimize I/O cost, we partition a massive set of input signals into smaller batches such that caching the signals one batch at a time maximizes data reuse and minimizes disk I/O. To reduce CPU cost, we propose two approximation algorithms. Our first algorithm efficiently computes approximate correlation coefficients of similar signal pairs within a given error bound. The second algorithm efficiently identifies, without any false positives or negatives, all signal pairs with correlations above a given threshold. For many real applications, our approximate solutions are as useful as corresponding exact solutions, due to our strict error guarantees. However, compared to the state-of-the-art exact algorithms, our algorithms are up to 17x faster for several real datasets.
机译:我们考虑在包含大量(例如,成千上万)的时序(或信号)的仓库中计算全对相关性的问题。在数据密集型应用中的自动发现模式和异常时出现问题,如数据中心管理,环境监测和科学实验。然而,通过现有技术,由于问题的固有的二次I / O和CPU复杂性,解决大型流仓库的问题非常昂贵。我们提出了基于离散傅里叶变换(DFT)和曲线图分区的小说算法,以减少全对相关查询的端到端响应时间。为了最小化I / O成本,我们将大量输入信号分配成较小的批量,使得一次缓存信号一批最大化数据重用并最小化磁盘I / O。为了降低CPU成本,我们提出了两个近似算法。我们的第一算法有效地计算给定错误绑定内的类似信号对的近似相关系数。第二算法有效地识别,而无需任何误报或否定,所有信号对都具有高于给定阈值的相关性。对于许多真实应用,由于我们严格的错误保证,我们的近似解决方案与相应的精确解决方案一样有用。然而,与最先进的精确算法相比,对于几个真实数据集,我们的算法速度高达17倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号