Scalable machine learning computing a data summarization matrix with a parallel array DBMS

Ordonez Carlos; Zhang Yiqun; Johnsson S. Lennart

首页> 外文期刊>Distributed and Parallel Databases >Scalable machine learning computing a data summarization matrix with a parallel array DBMS

【24h】

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

机译：可扩展机器学习使用并行数组DBMS计算数据汇总矩阵

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Big data analytics requires scalable (beyond RAM limits) and highly parallel (exploiting many CPU cores) processing of machine learning models, which in general involve heavy matrix manipulation. Array DBMSs represent a promising system to manipulate large matrices. With that motivation in mind, we present a high performance system exploiting a parallel array DBMS to evaluate a general, but compact, matrix summarization that benefits many machine learning models. We focus on two representative models: linear regression (supervised) and PCA (unsupervised). Our approach combines data summarization inside the parallel DBMS with further model computation in a mathematical language (e.g. R). We introduce a two-phase algorithm which first computes a general data summary in parallel and then evaluates matrix equations with reduced intermediate matrices in main memory on one node. We present theory results characterizing speedup and time/space complexity. From a parallel data system perspective, we consider scale-up and scale-out in a shared-nothing architecture. In contrast to most big data analytic systems, our system is based on array operators programmed in C++, working directly on the Unix file system instead of Java or Scala running on HDFS mounted of top of Unix, resulting in much faster processing. Experiments compare our system with Spark (parallel) and R (single machine), showing orders of magnitude time improvement. We present parallel benchmarks varying number of threads and processing nodes. Our two-phase approach should motivate analysts to exploit a parallel array DBMS for matrix summarization.

机译：大数据分析需要对机器学习模型进行可扩展的（超出RAM限制）和高度并行的（利用许多CPU内核）处理，这些处理通常涉及大量矩阵操作。阵列DBMS代表了一个有前途的系统来处理大型矩阵。考虑到这一动机，我们提出了一种高性能系统，该系统利用并行阵列DBMS来评估有益于许多机器学习模型的通用但紧凑的矩阵汇总。我们专注于两个代表性模型：线性回归（监督）和PCA（无监督）。我们的方法将并行DBMS内部的数据汇总与使用数学语言（例如R）的进一步模型计算相结合。我们引入一种两阶段算法，该算法首先并行计算一般数据摘要，然后在一个节点上的主存储器中使用减少的中间矩阵来评估矩阵方程。我们提出表征加速和时间/空间复杂性的理论结果。从并行数据系统的角度来看，我们考虑无共享架构中的向上扩展和向外扩展。与大多数大数据分析系统相反，我们的系统基于C ++编程的数组运算符，可直接在Unix文件系统上运行，而不是在Unix顶部安装的HDFS上运行Java或Scala，从而大大加快了处理速度。实验将我们的系统与Spark（并行）和R（单机）进行了比较，显示时间缩短了几个数量级。我们提出了不同线程和处理节点数量的并行基准测试。我们的两阶段方法应激励分析人员利用并行阵列DBMS进行矩阵汇总。

著录项

来源
《Distributed and Parallel Databases》 |2019年第3期|329-350|共22页
作者
Ordonez Carlos; Zhang Yiqun; Johnsson S. Lennart;
展开▼
作者单位

Univ Houston, Dept Comp Sci, Houston, TX 77204 USA;

Univ Houston, Dept Comp Sci, Houston, TX 77204 USA;

Univ Houston, Dept Comp Sci, Houston, TX 77204 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Matrix; Summarization; Parallel DBMS; Linear algebra;

机译：矩阵;摘要;并行DBMS;线性代数;

相似文献

外文文献
中文文献
专利

1. Scalable machine learning computing a data summarization matrix with a parallel array DBMS [J] . Ordonez Carlos, Zhang Yiqun, Johnsson S. Lennart Distributed and Parallel Databases . 2019,第3期

机译：可扩展机器学习使用并行阵列DBMS计算数据摘要矩阵
2. DMP-ELMs: Data and model parallel extreme learning machines for large-scale learning tasks [J] . Ming Yuewei, Zhu En, Wang Mao, Neurocomputing . 2018,第DECa3期

机译：DMP-ELM：用于大规模学习任务的数据和模型并行极限学习机
3. Research on SVM environment performance of parallel computing based on large data set of machine learning [J] . Gong Yunlu, Jia Lianguo Journal of supercomputing . 2019,第9期

机译：基于机器学习大数据集的并行计算的SVM环境性能研究
4. Scalable Machine Learning on Popular Analytic Languages with Parallel Data Summarization [C] . Sikder Tahsin Al-Amin, Carlos Ordonez International conference on big data analytics and knowledge discovery . 2020

机译：通过并行数据摘要的流行分析语言的可扩展机器学习
5. Parallel Processing Systems for Data and Computation Efficiency with Applications to Graph Computing and Machine Learning [D] . ?Zhou, Li 2019

机译：用于数据和计算效率的并行处理系统，具有图形计算和机器学习的应用
6. COVID-CT-MD COVID-19 computed tomography scan dataset applicable in machine learning and deep learning [O] . Parnian Afshar, Shahin Heidarian, Nastaran Enshaei, 2021

机译：Covid-CT-MDCovid-19计算断层扫描扫描数据集适用于机器学习和深度学习
7. MapReduce Platform for Parallel Machine Learning on Large-scale Dataset [O] . Toshihiko Yanase, Keiichi Hiroki, Akihiro Itoh, 2011

机译：MapReduce平台，用于在大型数据集上并行机器学习

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅