Zips: Mining Compressing Sequential Patterns in Streams

Hoang Thanh Lam; Toon Calders; Jie Yang; Fabian Morchen; Dmitriy Fradkin

首页> 外文期刊>SIGKDD explorations >Zips: Mining Compressing Sequential Patterns in Streams

【24h】

Zips: Mining Compressing Sequential Patterns in Streams

机译：压缩：挖掘流中的压缩顺序模式

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We propose a streaming algorithm, based on the minimal description length (MDL) principle, for extracting non-redundant sequential patterns. For static databases, the MDL-based approach that selects patterns based on their capacity to compress data rather than their frequency, was shown to be remarkably effective for extracting meaningful patterns and solving the redundancy issue in frequent itemset and sequence mining. The existing MDL-based algorithms, however, either start from a seed set of frequent patterns, or require multiple passes through the data. As such, the existing approaches scale poorly and are unsuitable for large datasets. Therefore, our main contribution is the proposal of a new, streaming algorithm, called Zips, that does not require a seed set of patterns and requires only one scan over the data. For Zips, we extended the Lempel-Ziv (LZ) compression algorithm in three ways: first, whereas LZ assigns codes uniformly as it builds up its dictionary while scanning the input, Zips assigns codewords according to the usage of the dictionary words; more heavily used words get shorter code-lengths. Secondly, Zips exploits also non-consecutive occurrences of dictionary words for compression. And, third, the well-known space-saving algorithm is used to evict unpromising words from the dictionary. Experiments on one synthetic and two real-world large-scale datasets show that our approach extracts meaningful compressing patterns with similar quality to the state-of-the-art multi-pass algorithms proposed for static databases of sequences. Moreover, our approach scales linearly with the size of data streams while all the existing algorithms do not.

机译：我们提出了一种基于最小描述长度（MDL）原理的流算法，用于提取非冗余顺序模式。对于静态数据库，基于MDL的方法基于模式压缩数据的能力而不是频率来选择模式，这对于提取有意义的模式以及解决频繁项集和序列挖掘中的冗余问题非常有效。但是，现有的基于MDL的算法要么从一组频繁模式的种子开始，要么需要多次遍历数据。因此，现有方法的伸缩性很差，因此不适用于大型数据集。因此，我们的主要贡献是提出了一种称为Zips的新型流算法，该算法不需要种子模式集，并且只需要对数据进行一次扫描即可。对于Zips，我们通过以下三种方式扩展了Lempel-Ziv（LZ）压缩算法：首先，LZ在扫描输入时建立字典时会统一分配代码，Zips根据字典词的用法分配代码字。使用频率更高的字词的代码长度越短。其次，Zips还利用字典单词的非连续出现进行压缩。第三，众所周知的节省空间的算法用于从字典中逐出不希望的单词。在一个合成的和两个真实的大型数据集上进行的实验表明，我们的方法提取的有意义的压缩模式质量与为静态序列数据库提出的最新的多遍算法相似。此外，我们的方法与数据流的大小呈线性比例关系，而所有现有算法均没有。

著录项

来源
《SIGKDD explorations》 |2013年第cdarom期|共9页
作者
Hoang Thanh Lam; Toon Calders; Jie Yang; Fabian Morchen; Dmitriy Fradkin;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类 TP274.2;
关键词

相似文献

外文文献
中文文献
专利

1. Zips: Mining Compressing Sequential Patterns in Streams [J] . Hoang Thanh Lam, Toon Calders, Jie Yang, SIGKDD explorations . 2013,第CDaROM期

机译：压缩：挖掘流中的压缩顺序模式
2. Mining Compressing Sequential Patterns [J] . Lam Hoang Thanh, Moerchen Fabian, Fradkin Dmitriy, Statistical Analysis and Data Mining . 2014,第1期

机译：采矿压缩顺序图案
3. Mining Compressing Sequential Patterns [J] . Hoang Thanh Lam, Fabian Mrchen, Dmitriy Fradkin, Statistical Analysis and Data Mining . 2013,第1期

机译：挖掘压缩顺序模式
4. Incremental Mining of Across-streams Sequential Patterns in Multiple Data Streams [C] . Ching-Ming Chao, Yan-Ting Lin 25th international conference on computers and their applications 2010 . 2010

机译：多个数据流中跨流顺序模式的增量挖掘
5. Mining frequent sequential patterns in data streams using SSM-algorithm. [D] . Monwar, Mostafa. 2005

机译：使用SSM算法在数据流中挖掘频繁的顺序模式。
6. Applying sequential pattern mining to investigate cerebrovascular health outpatients’ re-visit patterns [O] . Chao Ou-Yang, Chandrawati Putri Wulandari, Rizka Aisha Rahmi Hariadi, -1

机译：应用顺序模式挖掘来调查脑血管健康门诊病人的重访模式
7. Sequential Pattern Mining for Uncertain Data Streams using Sequential Sketch [O] . Jingyu Chen, Ping Chen 2015

机译：基于序列草图的不确定数据流序列模式挖掘

Zips: Mining Compressing Sequential Patterns in Streams

摘要

著录项

相似文献

相关主题

期刊订阅