Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance

Ruoming Jin; Ge Yang; Agrawal G.

首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance

【24h】

Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance

机译：数据挖掘算法的共享内存并行化：技术，编程接口和性能

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In This work, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of popular data mining algorithms. In addition, we propose a reduction-object-based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the techniques we have developed starting from a common specification of the algorithm. We have carried out a detailed evaluation of the parallelization techniques and the programming interface. We have experimented with apriori and fp-tree-based association mining, k-means clustering, k-nearest neighbor classifier, and decision tree construction. The main results from our experiments are as follows: 1) Among full replication, optimized full locking, and cache-sensitive locking, there is no clear winner. Each of these three techniques can outperform others depending upon machine and dataset parameters. These three techniques perform significantly better than the other two techniques. 2) Good parallel efficiency is achieved for each of the four algorithms we experimented with, using our techniques and runtime system. 3) The overhead of the interface is within 10 percent in almost all cases. 4) In the case of decision tree construction, combining different techniques turned out to be crucial for achieving high performance.

机译：随着最新技术的进步，共享内存并行机变得更具可扩展性，并提供了较大的主内存和较高的总线带宽。它们正在成为数据仓库和数据挖掘的良好平台。在这项工作中，我们专注于数据挖掘算法的共享内存并行化。我们已经开发了一系列用于数据挖掘算法并行化的技术，包括完全复制，完全锁定，固定锁定，优化的完全锁定以及对缓存敏感的锁定。与先前有关特定数据挖掘算法的共享内存并行化的工作不同，我们所有的技术都适用于大量流行的数据挖掘算法。此外，我们提出了一种基于约简对象的接口，用于指定数据挖掘算法。我们展示了运行时系统如何应用我们从算法的通用规范开始开发的任何技术。我们已经对并行化技术和编程接口进行了详细的评估。我们已经尝试了基于先验和基于fp树的关联挖掘，k均值聚类，k最近邻分类器和决策树构造。我们的实验的主要结果如下：1）在完全复制，优化的完全锁定和对缓存敏感的锁定中，没有明显的赢家。取决于机器和数据集参数，这三种技术中的每一种均可胜过其他技术。这三种技术的性能明显优于其他两种技术。 2）使用我们的技术和运行时系统，对我们实验的四种算法中的每种算法均实现了良好的并行效率。 3）在几乎所有情况下，接口的开销都在10％以内。 4）在构建决策树的情况下，结合不同的技术对于实现高性能至关重要。

著录项

来源
《IEEE Transactions on Knowledge and Data Engineering》 |2005年第1期|p.71-89|共19页
作者
Ruoming Jin; Ge Yang; Agrawal G.;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
data mining; data warehouses; decision trees; formal specification; middleware; parallel algorithms; parallel machines; pattern clustering; shared memory systems; cache-sensitive locking; data mining algorithm; data warehousing; dataset parameter; decision tree const;

机译：数据挖掘;数据仓库;决策树;形式规范;中间件;并行算法;并行机;模式聚类;共享内存系统;缓存敏感锁定;数据挖掘算法;数据仓库;数据集参数;决策树const;

相似文献

外文文献
中文文献
专利

1. Exploiting Distributed-Memory and Shared-Memory Parallelism on Clusters of SMPs with Data Parallel Programs [J] . Siegfried Benkner, Viera Sipkova International journal of parallel programming . 2003,第1期

机译：利用数据并行程序在SMP群集上利用分布式内存和共享内存并行性
2. Parallel evolutionary algorithms based on shared memory programming approaches [J] . J. L. Redondo, I. García, P. M. Ortigosa The Journal of Supercomputing . 2011,第2期

机译：基于共享内存编程方法的并行进化算法
3. Parallel evolutionary algorithms based on shared memory programming approaches [J] . J.L. Redondo, I. Garcia, P.M. Ortigosa Journal of supercomputing . 2011,第2期

机译：基于共享内存编程方法的并行进化算法
4. Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance [C] . Ruoming Jin, Gagan Agrawal SIAM International Conference on Data Mining . 2002

机译：数据挖掘算法的共享内存并行化：技术，编程界面和性能
5. Impact of shared memory and distributed memory platforms on the design and performance of parallel evolutionary algorithms. [D] . James, Tabitha Lynn. 2002

机译：共享内存和分布式内存平台对并行进化算法的设计和性能的影响。
6. Analysis of Parallel Algorithms on SMP Node and Cluster of Workstations Using Parallel Programming Models with New Tile-based Method for Large Biological Datasets [O] . D. D. Shrimankar, S. R. Sathe 2016

机译：大型生物数据集基于新图块的并行编程模型对SMP节点和工作站集群的并行算法进行分析
7. Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance [O] . Ruoming Jin, Ge Yang, Gagan Agrawal, 2004

机译：数据挖掘算法的共享内存并行化：技术，编程接口和性能
8. Performance Evaluation of Remote Memory Access (RMA) Programming on Shared Memory Parallel Computers [R] . Jin, Hao-Qiang, Jost, Gabriele 2002

机译：共享存储器并行计算机上远程内存访问（Rma）编程的性能评估

Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅