首页> 外文会议>European Conference on Principles of Data Mining and Knowledge Discovery >A Scalable Constant-Memory Sampling Algorithm for Pattern Discovery in Large Databases
【24h】

A Scalable Constant-Memory Sampling Algorithm for Pattern Discovery in Large Databases

机译:大型数据库中模式发现的可扩展常数存储器采样算法

获取原文

摘要

Many data mining tasks can be seen as an instance of the problem of finding the most interesting (according to some utility function) patterns in a large database. In recent years, significant progress has been achieved in scaling algorithms for this task to very large databases through the use of sequential sampling techniques. However, except for sampling-based greedy algorithms which cannot give absolute quality guarantees, the scalability of existing approaches to this problem is only with respect to the data, not with respect to the size of the pattern space: it is universally assumed that the entire hypothesis space fits in main memory. In this paper, we describe how this class of algorithms can be extended to hypothesis spaces that do not fit in memory while maintaining the algorithms' precise epsilon-delta quality guarantees. We present a constant memory algorithm for this task and prove that it possesses the required properties. In an empirical comparison, we compare variable memory and constant memory sampling.
机译:许多数据挖掘任务可以被视为找到大数据库中最有趣的(根据一些实用程序函数)模式的问题的实例。近年来,通过使用顺序采样技术,在这项任务的比例算法中实现了显着进展。但是,除了不能提供绝对质量保证的基于采样的贪婪算法之外,该问题的现有方法的可扩展性仅是关于数据的,而不是关于模式空间的大小:它普遍认为整个假设空间适合主记忆。在本文中,我们将介绍如何类算法可以扩展到假设空间,不适合在内存中,同时保持算法精确的小量-Δ质量保证。我们为此任务呈现了一个恒定的内存算法,并证明它具有所需的属性。在实证比较中,我们比较可变内存和恒定的内存采样。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号