...
首页> 外文期刊>SIGKDD explorations >Zips: Mining Compressing Sequential Patterns in Streams
【24h】

Zips: Mining Compressing Sequential Patterns in Streams

机译:压缩:挖掘流中的压缩顺序模式

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

We propose a streaming algorithm, based on the minimal description length (MDL) principle, for extracting non-redundant sequential patterns. For static databases, the MDL-based approach that selects patterns based on their capacity to compress data rather than their frequency, was shown to be remarkably effective for extracting meaningful patterns and solving the redundancy issue in frequent itemset and sequence mining. The existing MDL-based algorithms, however, either start from a seed set of frequent patterns, or require multiple passes through the data. As such, the existing approaches scale poorly and are unsuitable for large datasets. Therefore, our main contribution is the proposal of a new, streaming algorithm, called Zips, that does not require a seed set of patterns and requires only one scan over the data. For Zips, we extended the Lempel-Ziv (LZ) compression algorithm in three ways: first, whereas LZ assigns codes uniformly as it builds up its dictionary while scanning the input, Zips assigns codewords according to the usage of the dictionary words; more heavily used words get shorter code-lengths. Secondly, Zips exploits also non-consecutive occurrences of dictionary words for compression. And, third, the well-known space-saving algorithm is used to evict unpromising words from the dictionary. Experiments on one synthetic and two real-world large-scale datasets show that our approach extracts meaningful compressing patterns with similar quality to the state-of-the-art multi-pass algorithms proposed for static databases of sequences. Moreover, our approach scales linearly with the size of data streams while all the existing algorithms do not.
机译:我们提出了一种基于最小描述长度(MDL)原理的流算法,用于提取非冗余顺序模式。对于静态数据库,基于MDL的方法基于模式压缩数据的能力而不是频率来选择模式,这对于提取有意义的模式以及解决频繁项集和序列挖掘中的冗余问题非常有效。但是,现有的基于MDL的算法要么从一组频繁模式的种子开始,要么需要多次遍历数据。因此,现有方法的伸缩性很差,因此不适用于大型数据集。因此,我们的主要贡献是提出了一种称为Zips的新型流算法,该算法不需要种子模式集,并且只需要对数据进行一次扫描即可。对于Zips,我们通过以下三种方式扩展了Lempel-Ziv(LZ)压缩算法:首先,LZ在扫描输入时建立字典时会统一分配代码,Zips根据字典词的用法分配代码字。使用频率更高的字词的代码长度越短。其次,Zips还利用字典单词的非连续出现进行压缩。第三,众所周知的节省空间的算法用于从字典中逐出不希望的单词。在一个合成的和两个真实的大型数据集上进行的实验表明,我们的方法提取的有意义的压缩模式质量与为静态序列数据库提出的最新的多遍算法相似。此外,我们的方法与数据流的大小呈线性比例关系,而所有现有算法均没有。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号