首页> 外文期刊>ACM transactions on knowledge discovery from data >Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping
【24h】

Addressing Big Data Time Series: Mining Trillions of Time Series Subsequences Under Dynamic Time Warping

机译:解决大数据时间序列:动态时间规整下挖掘数千个时间序列子序列

获取原文
获取原文并翻译 | 示例

摘要

Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms, including classification, clustering, motif discovery, anomaly detection, and so on. The difficulty of scaling a search to large datasets explains to a great extent why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine massive time series for the first time. We demonstrate the following unintuitive fact: in large datasets we can exactly search under Dynamic Time Warping (DTW) much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We explain how our ideas allow us to solve higher-level time series data mining problems such as motif discovery and clustering at scales that would otherwise be untenable. Moreover, we show how our ideas allow us to efficiently support the uniform scaling distance measure, a measure whose utility seems to be underappreciated, but which we demonstrate here. In addition to mining massive datasets with up to one trillion datapoints, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.
机译:大多数时间序列数据挖掘算法都使用相似性搜索作为核心子例程,因此,相似性搜索所花费的时间实际上是所有时间序列数据挖掘算法(包括分类,聚类,主题发现,异常检测等)的瓶颈。将搜索规模扩大到大型数据集的困难在很大程度上解释了为什么大多数有关时间序列数据挖掘的学术研究在考虑数百万个时间序列对象时却停滞不前,而许多工业和科学都在等待数十亿个时间序列对象被探索。在这项工作中,我们表明通过结合使用四个新颖的​​思想,我们可以首次搜索和挖掘大量时间序列。我们证明了以下不直观的事实:在大型数据集中,与当前最新的欧几里德距离搜索算法相比,我们可以在动态时间规整(DTW)下精确地进行搜索。我们展示了我们尝试过的最大的时间序列实验集上的工作。特别是,我们考虑的最大数据集大于有史以来发表的所有数据挖掘论文中考虑的所有时间序列数据集的总和。我们解释了我们的想法如何使我们能够解决更高级别的时间序列数据挖掘问题,例如主题发现和聚类,而这些问题在其他情况下是站不住脚的。此外,我们展示了我们的想法如何使我们有效地支持统一的缩放距离度量,该度量的效用似乎未得到充分认可,但我们在此处进行了演示。除了挖掘具有多达一万亿个数据点的海量数据集之外,我们还将证明我们的想法还对数据流的实时监控产生了影响,使我们能够处理比以前更快的到达速度和/或使用更便宜,功耗更低的设备。目前可能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号