【24h】

Parallel Mining of Top-K Frequent Itemsets in Very Large Text Database

机译:超大型文本数据库中前K个频繁项集的并行挖掘

获取原文
获取原文并翻译 | 示例

摘要

Frequent itemsets mining is a common and useful task in data mining. But most of the current mining algorithms can't be used in very large text database. In this paper, we propose a novel and efficient parallel algorithm parTFI which is used to find top-k frequent itemsets with specified minimum length in very large text database. Base on a simple data structure H-struct, parTFI uses a novel logical vertical data partition- technique to mine top-k frequent itemsets at each mining server parallel. Our performance study shows that when processing very large sparse text database, parTFI outperforms Apriori and FP-growth, two efficient frequent iemsets mining algorithms, even when both are running with the better tuned min_support. Furthermore, by creating H-struct dynamically, parTFI can suit even huge dataset that most other algorithms can't process.
机译:频繁项集挖掘是数据挖掘中常见且有用的任务。但是当前大多数挖掘算法都不能在超大型文本数据库中使用。在本文中,我们提出了一种新颖高效的并行算法parTFI,该算法用于在非常大的文本数据库中查找具有指定最小长度的前k个频繁项集。 parTFI基于简单的数据结构H结构,使用一种新颖的逻辑垂直数据分区技术在每个并行的挖掘服务器上挖掘前k个频繁项集。我们的性能研究表明,当处理非常大的稀疏文本数据库时,parTFI的性能优于Apriori和FP-growth,这两种有效的频繁贴图集挖掘算法,即使两者都在优化的min_support上运行。此外,通过动态创建H结构,parTFI甚至可以适应大多数其他算法无法处理的巨大数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号