首页> 外文期刊>ACM transactions on knowledge discovery from data >Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees
【24h】

Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

机译:通过严格的性能保证抽样,有效发现关联规则和常用项目集

获取原文
获取原文并翻译 | 示例

摘要

The tasks of extracting (top-K) Frequent Itemsets (FIs) and Association Rules (ARs) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High-quality approximations of FIs and ARs are sufficient for most practical uses. Sampling techniques can be used for fast discovery of approximate solutions, but works exploring this technique did not provide satisfactory performance guarantees on the quality of the approximation due to the difficulty of bounding the probability of under- or oversampling any one of an unknown number of frequent itemsets. We circumvent this issue by applying the statistical concept of Vapnik-Chervonenkis (VC) dimension to develop a novel technique for providing tight bounds on the sample size that guarantees approximation of the (top-K) FIs and ARs within user-specified parameters. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset. We analyze the VC-dimension of this range space and show that it is upper bounded by an easy-to-compute characteristic quantity of the dataset, the d-index, namely, the maximum integer d such that the dataset contains at least d transactions of length at least d such that no one of them is a superset of or equal to another. We show that this bound is tight for a large class of datasets. The resulting sample size is a significant improvement over previous known results. We present an extensive experimental evaluation of our technique on real and artificial datasets, demonstrating the practicality of our methods, and showing that they achieve even higher quality approximations than what is guaranteed by the analysis.
机译:提取(前K个)频繁项集(FI)和关​​联规则(AR)的任务是数据挖掘和数据库应用程序中的基本原语。存在解决这些问题的精确算法并被广泛使用,但是由于需要扫描整个数据集(可能需要多次扫描)而阻碍了它们的运行时间。 FI和AR的高质量近似值足以满足大多数实际用途。采样技术可用于快速发现近似解,但是探索该技术的工作由于难以限制对未知数频繁项中的任何一个进行欠采样或过采样的概率的限制而无法提供令人满意的近似质量性能保证。项目集。我们通过应用Vapnik-Chervonenkis(VC)维度的统计概念来规避此问题,从而开发出一种新颖的技术来为样本量提供严格的界限,从而保证在用户指定的参数内逼近(top-K)FI和AR。所得样本大小线性依赖于与数据集关联的范围空间的VC维。我们分析了该范围空间的VC维,并表明它受数据集易于计算的特征量d-index(即最大整数d)的上限,从而使数据集至少包含d个交易长度至少为d,以使它们中没有一个是另一个或等于另一个的超集。我们表明,对于一大类数据集,该界限是紧密的。所得的样本量比以前的已知结果有了显着改善。我们在真实和人工数据集上对我们的技术进行了广泛的实验评估,证明了我们方法的实用性,并表明与分析所保证的方法相比,它们可以获得更高的质量近似值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号