Scalable machine learning computing a data summarization matrix with a parallel array DBMS

Ordonez Carlos; Zhang Yiqun; Johnsson S. Lennart

首页> 外文期刊>Distributed and Parallel Databases >Scalable machine learning computing a data summarization matrix with a parallel array DBMS

【24h】

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

机译：可扩展机器学习使用并行阵列DBMS计算数据摘要矩阵

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Big data analytics requires scalable (beyond RAM limits) and highly parallel (exploiting many CPU cores) processing of machine learning models, which in general involve heavy matrix manipulation. Array DBMSs represent a promising system to manipulate large matrices. With that motivation in mind, we present a high performance system exploiting a parallel array DBMS to evaluate a general, but compact, matrix summarization that benefits many machine learning models. We focus on two representative models: linear regression (supervised) and PCA (unsupervised). Our approach combines data summarization inside the parallel DBMS with further model computation in a mathematical language (e.g. R). We introduce a two-phase algorithm which first computes a general data summary in parallel and then evaluates matrix equations with reduced intermediate matrices in main memory on one node. We present theory results characterizing speedup and time/space complexity. From a parallel data system perspective, we consider scale-up and scale-out in a shared-nothing architecture. In contrast to most big data analytic systems, our system is based on array operators programmed in C++, working directly on the Unix file system instead of Java or Scala running on HDFS mounted of top of Unix, resulting in much faster processing. Experiments compare our system with Spark (parallel) and R (single machine), showing orders of magnitude time improvement. We present parallel benchmarks varying number of threads and processing nodes. Our two-phase approach should motivate analysts to exploit a parallel array DBMS for matrix summarization.

机译：大数据分析需要可扩展的（超出RAM限制），并高度平行（利用许多CPU核心）处理机器学习模型，这通常涉及重型矩阵操纵。数组DBMS表示有希望的系统来操纵大矩阵。通过考虑到这种动机，我们展示了一种高性能系统，利用并行阵列DBMS来评估一般但紧凑，矩阵摘要，这些概况有益于许多机器学习模型。我们专注于两个代表性模型：线性回归（监督）和PCA（无人监督）。我们的方法将并行DBMS内部的数据摘要与数学语言（例如R）的进一步计算相结合。我们介绍了一种双相算法，该两相算法首先并行计算一般数据摘要，然后在一个节点上的主存储器中评估矩阵方程。我们提出了理论结果，表征了加速和时间/空间复杂性。从并行数据系统的角度来看，我们考虑在共享的架构中进行扩展和扩展。与大多数大数据分析系统相比，我们的系统基于C ++中编程的阵列运算符，直接在UNIX文件系统上工作而不是在UNIX顶部安装的HDF上运行的Java或Scala，从而更快地处理。实验将我们的系统与火花（并联）和R（单机）进行了比较，显示了数量级时间改进的秩序。我们提出了不同数量的线程和处理节点的并行基准。我们的两相方法应激励分析师利用并行阵列DBMS进行矩阵摘要。

著录项

来源
《Distributed and Parallel Databases》 |2019年第3期|329-350|共22页
作者
Ordonez Carlos; Zhang Yiqun; Johnsson S. Lennart;
展开▼
作者单位

Univ Houston Dept Comp Sci Houston TX 77204 USA;

Univ Houston Dept Comp Sci Houston TX 77204 USA;

Univ Houston Dept Comp Sci Houston TX 77204 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Matrix; Summarization; Parallel DBMS; Linear algebra;

机译：矩阵;摘要;并行DBMS;线性代数;

相似文献

外文文献
中文文献
专利

1. Scalable machine learning computing a data summarization matrix with a parallel array DBMS [J] . Ordonez Carlos, Zhang Yiqun, Johnsson S. Lennart Distributed and Parallel Databases . 2019,第3期

机译：可扩展机器学习使用并行数组DBMS计算数据汇总矩阵
2. DMP-ELMs: Data and model parallel extreme learning machines for large-scale learning tasks [J] . Ming Yuewei, Zhu En, Wang Mao, Neurocomputing . 2018,第DECa3期

机译：DMP-ELM：用于大规模学习任务的数据和模型并行极限学习机
3. Research on SVM environment performance of parallel computing based on large data set of machine learning [J] . Gong Yunlu, Jia Lianguo Journal of supercomputing . 2019,第9期

机译：基于机器学习大数据集的并行计算的SVM环境性能研究
4. Scalable Machine Learning on Popular Analytic Languages with Parallel Data Summarization [C] . Sikder Tahsin Al-Amin, Carlos Ordonez International conference on big data analytics and knowledge discovery . 2020

机译：通过并行数据摘要的流行分析语言的可扩展机器学习
5. Parallel Processing Systems for Data and Computation Efficiency with Applications to Graph Computing and Machine Learning [D] . ?Zhou, Li 2019

机译：用于数据和计算效率的并行处理系统，具有图形计算和机器学习的应用
6. COVID-CT-MD COVID-19 computed tomography scan dataset applicable in machine learning and deep learning [O] . Parnian Afshar, Shahin Heidarian, Nastaran Enshaei, 2021

机译：Covid-CT-MDCovid-19计算断层扫描扫描数据集适用于机器学习和深度学习
7. MapReduce Platform for Parallel Machine Learning on Large-scale Dataset [O] . Toshihiko Yanase, Keiichi Hiroki, Akihiro Itoh, 2011

机译：MapReduce平台，用于在大型数据集上并行机器学习

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅