Fast Approximate Correlation for Massive Time-series Data

机译：大规模时序数据的快速近似相关性

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We consider the problem of computing all-pair correlations in a warehouse containing a large number (e.g., tens of thousands) of time-series (or, signals). The problem arises in automatic discovery of patterns and anomalies in data intensive applications such as data center management, environmental monitoring, and scientific experiments. However, with existing techniques, solving the problem for a large stream warehouse is extremely expensive, due to the problem's inherent quadratic I/O and CPU complexities. We propose novel algorithms, based on Discrete Fourier Transformation (DFT) and graph partitioning, to reduce the end-to-end response time of an all-pair correlation query. To minimize I/O cost, we partition a massive set of input signals into smaller batches such that caching the signals one batch at a time maximizes data reuse and minimizes disk I/O. To reduce CPU cost, we propose two approximation algorithms. Our first algorithm efficiently computes approximate correlation coefficients of similar signal pairs within a given error bound. The second algorithm efficiently identifies, without any false positives or negatives, all signal pairs with correlations above a given threshold. For many real applications, our approximate solutions are as useful as corresponding exact solutions, due to our strict error guarantees. However, compared to the state-of-the-art exact algorithms, our algorithms are up to 17x faster for several real datasets.

机译：我们考虑在包含大量（例如，成千上万）的时序（或信号）的仓库中计算全对相关性的问题。在数据密集型应用中的自动发现模式和异常时出现问题，如数据中心管理，环境监测和科学实验。然而，通过现有技术，由于问题的固有的二次I / O和CPU复杂性，解决大型流仓库的问题非常昂贵。我们提出了基于离散傅里叶变换（DFT）和曲线图分区的小说算法，以减少全对相关查询的端到端响应时间。为了最小化I / O成本，我们将大量输入信号分配成较小的批量，使得一次缓存信号一批最大化数据重用并最小化磁盘I / O。为了降低CPU成本，我们提出了两个近似算法。我们的第一算法有效地计算给定错误绑定内的类似信号对的近似相关系数。第二算法有效地识别，而无需任何误报或否定，所有信号对都具有高于给定阈值的相关性。对于许多真实应用，由于我们严格的错误保证，我们的近似解决方案与相应的精确解决方案一样有用。然而，与最先进的精确算法相比，对于几个真实数据集，我们的算法速度高达17倍。

著录项

来源
《ACM SIGMOD international conference on management of data》|2010年||共12页
会议地点
作者

展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序设计、软件工程;
关键词
correlation matrix; discrete fourier transform;

机译：相关矩阵;离散傅里叶变换;

相似文献

外文文献
中文文献
专利

1. Fast correlation coefficient estimation algorithm for HBase-based massive time series data [J] . Liu Wen, Zhang Tuqian, Shen Yanming, Frontiers of computer science in China . 2019,第4期

机译：基于HBase的海量时间序列数据的快速相关系数估计算法
2. Fast correlation coefficient estimation algorithm for HBase-based massive time series data [J] . Liu Wen, Zhang Tuqian, Shen Yanming, Frontiers of computer science . 2019,第4期

机译：基于HBase的大量时间序列数据快速相关系数估计算法
3. Approximate Clustering of Time-Series Datasets using k-Modes Partitioning [J] . Saeed Aghabozorgi, Teh Ying Wah Journal of information science and engineering . 2015,第1期

机译：使用k模式分区的时间序列数据集的近似聚类
4. Fast Approximate Correlation for Massive Time-series Data [C] . Abdullah Mueen, Suman Nath, Jie Liu ACM SIGMOD international conference on management of data;SIGMOD 2010 . 2010

机译：海量时间序列数据的快速近似相关
5. Approximate Search on Massive Spatiotemporal Datasets. [D] . Brugere, Ivan. 2012

机译：大规模时空数据集的近似搜索。
6. Fast randomized approximate string matching with succinct hash data structures [O] . Alberto Policriti, Nicola Prezza 2015

机译：快速随机近似字符串匹配具有简洁的哈希数据结构
7. Fast Approximate Correlation for Massive Time-series Data [O] . Abdullah Mueen, Suman Nath, Jie Liu 2010

机译：海量时间序列数据的快速近似相关
8. Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets [R] . Madduri, K., Ediger, D., Jiang, K., 2008

机译：更快的并行算法和高效的多线程实现，用于评估海量数据集的中介中心性

Fast Approximate Correlation for Massive Time-series Data

摘要

著录项

相似文献

相关主题

期刊订阅