...
首页> 外文期刊>Journal of supercomputing >Parallel mining of association rules from text databases
【24h】

Parallel mining of association rules from text databases

机译:从文本数据库并行挖掘关联规则

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we propose a new algorithm named Parallel Multipass with Inverted Hashing and Pruning (PMIHP) for mining association rules between words in text databases. The characteristics of text databases are quite different from those of retail transaction databases, and existing mining algorithms cannot handle text databases efficiently because of the large number of itemsets (i.e., sets of words) that need to be counted. The new PMIHP algorithm is a parallel version of our Multipass with Inverted Hashing and Pruning (MIHP) algorithm (Holt, Chung in: Proc of the 14th IEEE int'l conf on tools with artificial intelligence, 2002, pp 49-56), which was shown to be quite efficient than other existing algorithms in the context of mining text databases. The PMIHP algorithm reduces the overhead of communication between miners running on different processors because they are mining local data-. bases asynchronously and prune the global candidates by using the Inverted Hashing and Pruning technique. Compared with the well-known Count Distribution algorithm (Agrawal, Shafer in: (1996) IEEE Trans Knowl Data Eng 8(6):962-969), PMIHP demonstrates superior performance characteristics for mining association rules in large text databases, and when the minimum support level is low, its speedup is superlinear as the number of processors increases. These experiments were performed on a cluster of Linux workstations using a collection of Wall Street Journal articles.
机译:在本文中,我们提出了一种新的算法,该算法用于挖掘文本数据库中单词之间的关联规则,该算法称为“并行多遍与倒排和修剪”(PMIHP)。文本数据库的特征与零售交易数据库的特征完全不同,并且由于需要计算大量的项目集(即单词集),因此现有的挖掘算法无法有效地处理文本数据库。新的PMIHP算法是我们的​​带有反向哈希和修剪的多遍算法(MIHP)的并行版本(Holt,Chung,第14届IEEE人工智能工具国际会议论文集,2002年,第49-56页),在挖掘文本数据库的情况下,被证明比其他现有算法效率更高。 PMIHP算法减少了运行在不同处理器上的矿工之间的通信开销,因为它们正在挖掘本地数据。通过使用“反向哈希和修剪”技术异步地对基本候选对象进行修剪并修剪全局候选对象。与众所周知的计数分布算法(Agrawal,Shafer in:(1996)IEEE Trans Knowl Data Eng 8(6):962-969)相比,PMIHP展示了在大型文本数据库中挖掘关联规则的优越性能特征,以及当最低支持水平很低,随着处理器数量的增加,其提速是超线性的。这些实验使用《华尔街日报》的文章在Linux工作站集群上进行。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号