首页> 外文期刊>Distributed and Parallel Databases >Scalable machine learning computing a data summarization matrix with a parallel array DBMS
【24h】

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

机译:可扩展机器学习使用并行数组DBMS计算数据汇总矩阵

获取原文
获取原文并翻译 | 示例

摘要

Big data analytics requires scalable (beyond RAM limits) and highly parallel (exploiting many CPU cores) processing of machine learning models, which in general involve heavy matrix manipulation. Array DBMSs represent a promising system to manipulate large matrices. With that motivation in mind, we present a high performance system exploiting a parallel array DBMS to evaluate a general, but compact, matrix summarization that benefits many machine learning models. We focus on two representative models: linear regression (supervised) and PCA (unsupervised). Our approach combines data summarization inside the parallel DBMS with further model computation in a mathematical language (e.g. R). We introduce a two-phase algorithm which first computes a general data summary in parallel and then evaluates matrix equations with reduced intermediate matrices in main memory on one node. We present theory results characterizing speedup and time/space complexity. From a parallel data system perspective, we consider scale-up and scale-out in a shared-nothing architecture. In contrast to most big data analytic systems, our system is based on array operators programmed in C++, working directly on the Unix file system instead of Java or Scala running on HDFS mounted of top of Unix, resulting in much faster processing. Experiments compare our system with Spark (parallel) and R (single machine), showing orders of magnitude time improvement. We present parallel benchmarks varying number of threads and processing nodes. Our two-phase approach should motivate analysts to exploit a parallel array DBMS for matrix summarization.
机译:大数据分析需要对机器学习模型进行可扩展的(超出RAM限制)和高度并行的(利用许多CPU内核)处理,这些处理通常涉及大量矩阵操作。阵列DBMS代表了一个有前途的系统来处理大型矩阵。考虑到这一动机,我们提出了一种高性能系统,该系统利用并行阵列DBMS来评估有益于许多机器学习模型的通用但紧凑的矩阵汇总。我们专注于两个代表性模型:线性回归(监督)和PCA(无监督)。我们的方法将并行DBMS内部的数据汇总与使用数学语言(例如R)的进一步模型计算相结合。我们引入一种两阶段算法,该算法首先并行计算一般数据摘要,然后在一个节点上的主存储器中使用减少的中间矩阵来评估矩阵方程。我们提出表征加速和时间/空间复杂性的理论结果。从并行数据系统的角度来看,我们考虑无共享架构中的向上扩展和向外扩展。与大多数大数据分析系统相反,我们的系统基于C ++编程的数组运算符,可直接在Unix文件系统上运行,而不是在Unix顶部安装的HDFS上运行Java或Scala,从而大大加快了处理速度。实验将我们的系统与Spark(并行)和R(单机)进行了比较,显示时间缩短了几个数量级。我们提出了不同线程和处理节点数量的并行基准测试。我们的两阶段方法应激励分析人员利用并行阵列DBMS进行矩阵汇总。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号