Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms

Rong Gu; Yun Tang; Chen Tian; Hucheng Zhou; Guanru Li; Xudong Zheng; Yihua Huang

首页> 外文期刊>IEEE Transactions on Parallel and Distributed Systems >Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms

【24h】

Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms

机译：改善分布式数据并行平台上大规模矩阵乘法的执行并发性

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Matrix multiplication is a dominant but very time-consuming operation in many big data analytic applications. Thus its performance optimization is an important and fundamental research issue. The performance of large-scale matrix multiplication on distributed data-parallel platforms is determined by both computation and IO costs. For existing matrix multiplication execution strategies, when the execution concurrency scales up above a threshold, their execution performance deteriorates quickly because the increase of the IO cost outweighs the decrease of the computation cost. This paper presents a novel parallel execution strategy CRMM (Concurrent Replication-based Matrix Multiplication) along with a parallel algorithm, Marlin, for large-scale matrix multiplication on data-parallel platforms. The CRMM strategy exploits higher execution concurrency for sub-block matrix multiplication with the same IO cost. To further improve the performance of Marlin, we also propose a number of novel system-level optimizations, including increasing the concurrency of local data exchange by calling native library in batch, reducing the overhead of block matrix transformation, and reducing disk heavy shuffle operations by exploiting the semantics of matrix computation. We have implemented Marlin as a library along with a set of related matrix operations on Spark and also contributed Marlin to the open-source community. For large-sized matrix multiplication, Marlin outperforms existing systems including Spark MLlib, SystemML and SciDB, with about , and speedup on average, respectively. The evaluation upon a real-world DNN workload also indicates that Marlin outperforms above systems by about , and speedup, respectively.

机译：在许多大数据分析应用程序中，矩阵乘法是占主导地位但非常耗时的运算。因此，其性能优化是一个重要的基础研究课题。分布式数据并行平台上大规模矩阵乘法的性能取决于计算和IO成本。对于现有的矩阵乘法执行策略，当执行并发扩展到阈值以上时，它们的执行性能会迅速下降，因为IO成本的增加大于计算成本的减少。本文提出了一种新颖的并行执行策略CRMM（基于并发复制的矩阵乘法）以及并行算法Marlin，用于数据并行平台上的大规模矩阵乘法。 CRMM策略在相同的IO成本下利用更高的执行并行性进行子块矩阵乘法。为了进一步提高Marlin的性能，我们还提出了许多新颖的系统级优化，包括通过批量调用本机库来增加本地数据交换的并发性，减少块矩阵转换的开销，以及通过利用矩阵计算的语义。我们已经将Marlin实施为一个库，并在Spark上实现了一系列相关的矩阵运算，并且还将Marlin贡献给了开源社区。对于大型矩阵乘法，Marlin的性能分别优于Spark和MLlib，SystemML和SciDB，其平均速度分别约为，和。对现实世界中DNN工作负载的评估还表明，Marlin的性能分别优于上述系统，分别约为，和。

著录项

来源
《IEEE Transactions on Parallel and Distributed Systems》 |2017年第9期|2539-2552|共14页
作者
Rong Gu; Yun Tang; Chen Tian; Hucheng Zhou; Guanru Li; Xudong Zheng; Yihua Huang;
展开▼
作者单位

State Key Laboratory for Novel Software Technology, Nanjing University, Jiangsu Sheng, China;

State Key Laboratory for Novel Software Technology, Nanjing University, Jiangsu Sheng, China;

State Key Laboratory for Novel Software Technology, Nanjing University, Jiangsu Sheng, China;

Microsoft Research, Beijing, China;

Microsoft Research, Beijing, China;

Microsoft Research, Beijing, China;

State Key Laboratory for Novel Software Technology, Nanjing University, Jiangsu Sheng, China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Concurrent computing; Libraries; Sparks; Machine learning algorithms; Training; Optimization; Big Data;

机译：并发计算图书馆火花机械学习算法培训优化大数据;

相似文献

外文文献
中文文献
专利

1. ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform [J] . Bo Zhao1, Hucheng Zhou2, Guoqiang Li3, 大数据挖掘与分析(英文) . 2018,第001期

机译：ZenLDA：分布式数据并行平台上的大规模主题模型培训
2. Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems [J] . Acer Seher, Selvitopi Oguz, Aykanat Cevdet Parallel Computing . 2016,第nova期

机译：在大规模并行系统上提高稀疏矩阵稠密矩阵乘法的性能
3. Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms [J] . Hasanov Khalid, Quintin Jean-Noel, Lastovetsky Alexey Journal of supercomputing . 2015,第11期

机译：大规模平台上并行矩阵乘法优化的分层方法
4. Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms [C] . Quintin Jean-Noel, Hasanov Khalid, Lastovetsky Alexey International Conference on Parallel Processing . 2013

机译：大规模分布式存储平台上的分层并行矩阵乘法
5. Improving Data-Shuffle Performance in Data-Parallel Distributed Systems [D] . Samson, Shweelan. 2018

机译：在数据并行分布式系统中提高数据扫描性能
6. Integrating Remote Sensing Information Into A Distributed Hydrological Model for Improving Water Budget Predictions in Large-scale Basins through Data Assimilation [O] . Changbo Qin, Yangwen Jia, Z.(Bob) Su, 2008

机译：通过数据同化将遥感信息集成到分布式水文模型中以改善大型流域的水预算预测
7. Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms [O] . Khalid Hasanov, Alexey Lastovetsky 2016

机译：大规模分布式存储平台上的分层并行矩阵乘法
8. PUMMA: Parallel Universal Matrix Multiplication Algorithms on distributed memory concurrent computers. [R] . Choi, J., Walker, D. W., Dongarra, J. J. 1993

机译：pUmma：分布式内存并发计算机上的并行通用矩阵乘法算法。

Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅