首页> 外文会议>International conference on very large data bases >NetCube: A Scalable Tool for Fast Data Mining and Compression
【24h】

NetCube: A Scalable Tool for Fast Data Mining and Compression

机译:NetCube:一种可扩展的用于快速数据挖掘和压缩的工具

获取原文
获取外文期刊封面目录资料

摘要

We propose an novel method of computing and storing DataCubes. Our idea is to use Bayesian Networks, which can generate approximate counts for any query combination of attribute values and "don't cares." A Bayesian network represents the underlying joint probability distribution of the data that were used to generate it. By means of such a network the proposed method, NetCube, exploits correlations among attributes. Our proposed preprocessing algorithm scales linearly on the size of the database, and is thus scalable; it is also parallelizable with a straightforward parallel implementation. Moreover, we give an algorithm to estimate counts of arbitrary queries that is fast (constant on the database size). Experimental results show that NetCubes have fast generation and use (a few minutes preprocessing time per 100,000 records and less than a second query time), achieve excellent compression (at least 1800:1 compression ratios on real data) and have low reconstruction error (less than 5% on average). Moreover, our method naturally allows for visualization and data mining, at no extra cost.
机译:我们提出了一种计算和存储Datacubes的新方法。我们的想法是使用贝叶斯网络,这可以为属性值的任何查询组合生成近似计数,并“不关心”。贝叶斯网络代表了用于生成它的数据的基础联合概率分布。通过这种网络,所提出的方法NetCube,利用属性之间的相关性。我们提出的预处理算法在数据库的大小上线性缩放,因此可扩展;它还具有直接的平行实现并行化。此外,我们提供了一种算法来估计快速的任意查询的数量(数据库大小上的常量)。实验结果表明,Netcubes具有快速发电和使用(每10万条记录的预处理时间几分钟,小于第二个查询时间),实现了出色的压缩(在实际数据上至少有1800:1压缩比),并且重建误差低(较少平均5%)。此外,我们的方法自然允许可视化和数据挖掘,无需额外费用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号