首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance
【24h】

Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance

机译:数据挖掘算法的共享内存并行化:技术,编程接口和性能

获取原文
获取原文并翻译 | 示例
           

摘要

With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In This work, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of popular data mining algorithms. In addition, we propose a reduction-object-based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the techniques we have developed starting from a common specification of the algorithm. We have carried out a detailed evaluation of the parallelization techniques and the programming interface. We have experimented with apriori and fp-tree-based association mining, k-means clustering, k-nearest neighbor classifier, and decision tree construction. The main results from our experiments are as follows: 1) Among full replication, optimized full locking, and cache-sensitive locking, there is no clear winner. Each of these three techniques can outperform others depending upon machine and dataset parameters. These three techniques perform significantly better than the other two techniques. 2) Good parallel efficiency is achieved for each of the four algorithms we experimented with, using our techniques and runtime system. 3) The overhead of the interface is within 10 percent in almost all cases. 4) In the case of decision tree construction, combining different techniques turned out to be crucial for achieving high performance.
机译:随着最新技术的进步,共享内存并行机变得更具可扩展性,并提供了较大的主内存和较高的总线带宽。它们正在成为数据仓库和数据挖掘的良好平台。在这项工作中,我们专注于数据挖掘算法的共享内存并行化。我们已经开发了一系列用于数据挖掘算法并行化的技术,包括完全复制,完全锁定,固定锁定,优化的完全锁定以及对缓存敏感的锁定。与先前有关特定数据挖掘算法的共享内存并行化的工作不同,我们所有的技术都适用于大量流行的数据挖掘算法。此外,我们提出了一种基于约简对象的接口,用于指定数据挖掘算法。我们展示了运行时系统如何应用我们从算法的通用规范开始开发的任何技术。我们已经对并行化技术和编程接口进行了详细的评估。我们已经尝试了基于先验和基于fp树的关联挖掘,k均值聚类,k最近邻分类器和决策树构造。我们的实验的主要结果如下:1)在完全复制,优化的完全锁定和对缓存敏感的锁定中,没有明显的赢家。取决于机器和数据集参数,这三种技术中的每一种均可胜过其他技术。这三种技术的性能明显优于其他两种技术。 2)使用我们的技术和运行时系统,对我们实验的四种算法中的每种算法均实现了良好的并行效率。 3)在几乎所有情况下,接口的开销都在10%以内。 4)在构建决策树的情况下,结合不同的技术对于实现高性能至关重要。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号