Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Salah Saber; Akbarinia Reza; Masseglia Florent

首页> 外文期刊>Knowledge and information systems >Data placement in massively distributed environments for fast parallel mining of frequent itemsets

【24h】

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

机译：频繁分布式环境中大型分布式环境中的数据放置

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Frequent itemset mining presents one of the fundamental building blocks in data mining. However, despite the crucial recent advances that have been made in data mining literature, few of both standard and improved solutions scale. This is particularly the case when (1) the quantity of data tends to be very large and/or (2) the minimum support is very low. In this paper, we address the problem of parallel frequent itemset mining (PFIM) in very large databases and study the impact and effectiveness of using specific data placement strategies in a massively distributed environment. By offering a clever data placement and an optimal organization of the extraction algorithms, we show that the arrangement of both the data and the different processes can make the global job either completely inoperative or very effective. In this setting, we propose two different highly scalable, PFIM algorithms, namely P2S (parallel-2-steps) and PATD (parallel absolute top-down). P2S algorithm allows discovering itemsets from large databases in two simple, yet efficient parallel jobs, while PATD renders the mining process of very large databases more simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the running time, the communication cost and the energy power consumption overhead in a distributed computational platform. Our different proposed approaches have been extensively evaluated on massive real-world data sets. The experimental results confirm the effectiveness and scalability of our proposals by the important scale-up obtained with very low minimum supports compared to other alternatives.

机译：频繁的项目集挖掘呈现了数据挖掘中的一个基本构建块。然而，尽管在数据挖掘文献中取得了最近的最新进展，但标准和改善的解决方案规模很少。当（1）数据量趋于非常大的和/或（2）时，这是特别的情况，最小载体非常低。在本文中，我们解决了非常大的数据库中并行频繁的项目集挖掘（PFIM）的问题，并研究了在大规模分布式环境中使用特定数据放置策略的影响和有效性。通过提供巧妙的数据放置和提取算法的最佳组织，我们表明数据和不同进程的布置可以使全球工作完全不起作用或非常有效。在此设置中，我们提出了两种不同的高度可扩展，PFIM算法，即P2S（并行 - 2步）和Patd（并联绝对自上而下）。 P2S算法允许在两个简单但有效的平行作业中发现大型数据库的项目集，而Patd则使得大型数据库的采矿过程更加简单且紧凑。其采矿过程仅由一个并行工作组成，这显着降低了分布式计算平台中的运行时间，通信成本和能量功耗开销。我们的不同拟议方法已在大规模的现实数据集中进行了广泛评估。实验结果证实了我们提案的有效性和可扩展性，与其他替代方案相比，通过非常低的最小支撑率获得的重要扩展。

著录项

来源
《Knowledge and information systems》 |2017年第1期|共31页
作者
Salah Saber; Akbarinia Reza; Masseglia Florent;
展开▼
作者单位

INRIA Montpellier France;

INRIA Montpellier France;

INRIA Montpellier France;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类自动信息理论;
关键词
Frequent Itemsets; Massive Distribution; Data Placement; MapReduce;

机译：频繁的项目集;大规模分布;数据展示位置;MapReduce;

相似文献

外文文献
中文文献
专利

1. Data placement in massively distributed environments for fast parallel mining of frequent itemsets [J] . Salah Saber, Akbarinia Reza, Masseglia Florent Knowledge and information systems . 2017,第1期

机译：频繁分布式环境中大型分布式环境中的数据放置
2. Parallel and distributed methods for incremental frequent itemset mining [J] . Otey M.E., Parthasarathy S., Chao Wang, IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics . 2004,第6期

机译：增量频繁项集挖掘的并行和分布式方法
3. A Fast Parallel Association Rule Mining Algorithm Based on The Probability of Frequent Itemsets [J] . Marghny H. Mohamed, Hosam E. Refaat International journal of computer science and network security . 2011,第5期

机译：基于频繁项集概率的并行关联规则快速挖掘算法
4. Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments [C] . Saber Salah, Reza Akbarinia, Florent Masseglia International conference on database and expert systems applications . 2015

机译：大规模分布环境中快速挖掘频繁项集的数据分区
5. Mining Frequent Itemsets from Uncertain Data: Extensions to Constrained Mining and Stream Mining. [D] . Hao, Boyu. 2010

机译：从不确定的数据中挖掘频繁项集：约束挖掘和流挖掘的扩展。
6. Genetic Programming and Frequent Itemset Mining to Identify Feature Selection Patterns of iEEG and fMRI Epilepsy Data [O] . Otis Smart, Lauren Burrell -1

机译：遗传程序设计和频繁项集挖掘以识别iEEG和fMRI癫痫数据的特征选择模式
7. Data placement in massively distributed environments for fast parallel mining of frequent itemsets [O] . Salah, Saber, Akbarinia, Reza, Masseglia, Florent 2017

机译：大规模分布环境中的数据放置，用于快速并行挖掘频繁项集

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

摘要

著录项

相似文献

相关主题

期刊订阅