...
首页> 外文期刊>Knowledge and information systems >Data placement in massively distributed environments for fast parallel mining of frequent itemsets
【24h】

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

机译:频繁分布式环境中大型分布式环境中的数据放置

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Frequent itemset mining presents one of the fundamental building blocks in data mining. However, despite the crucial recent advances that have been made in data mining literature, few of both standard and improved solutions scale. This is particularly the case when (1) the quantity of data tends to be very large and/or (2) the minimum support is very low. In this paper, we address the problem of parallel frequent itemset mining (PFIM) in very large databases and study the impact and effectiveness of using specific data placement strategies in a massively distributed environment. By offering a clever data placement and an optimal organization of the extraction algorithms, we show that the arrangement of both the data and the different processes can make the global job either completely inoperative or very effective. In this setting, we propose two different highly scalable, PFIM algorithms, namely P2S (parallel-2-steps) and PATD (parallel absolute top-down). P2S algorithm allows discovering itemsets from large databases in two simple, yet efficient parallel jobs, while PATD renders the mining process of very large databases more simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the running time, the communication cost and the energy power consumption overhead in a distributed computational platform. Our different proposed approaches have been extensively evaluated on massive real-world data sets. The experimental results confirm the effectiveness and scalability of our proposals by the important scale-up obtained with very low minimum supports compared to other alternatives.
机译:频繁的项目集挖掘呈现了数据挖掘中的一个基本构建块。然而,尽管在数据挖掘文献中取得了最近的最新进展,但标准和改善的解决方案规模很少。当(1)数据量趋于非常大的和/或(2)时,这是特别的情况,最小载体非常低。在本文中,我们解决了非常大的数据库中并行频繁的项目集挖掘(PFIM)的问题,并研究了在大规模分布式环境中使用特定数据放置策略的影响和有效性。通过提供巧妙的数据放置和提取算法的最佳组织,我们表明数据和不同进程的布置可以使全球工作完全不起作用或非常有效。在此设置中,我们提出了两种不同的高度可扩展,PFIM算法,即P2S(并行 - 2步)和Patd(并联绝对自上而下)。 P2S算法允许在两个简单但有效的平行作业中发现大型数据库的项目集,而Patd则使得大型数据库的采矿过程更加简单且紧凑。其采矿过程仅由一个并行工作组成,这显着降低了分布式计算平台中的运行时间,通信成本和能量功耗开销。我们的不同拟议方法已在大规模的现实数据集中进行了广泛评估。实验结果证实了我们提案的有效性和可扩展性,与其他替代方案相比,通过非常低的最小支撑率获得的重要扩展。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号