首页> 外文期刊>Distributed and Parallel Databases >Scalable machine learning computing a data summarization matrix with a parallel array DBMS
【24h】

Scalable machine learning computing a data summarization matrix with a parallel array DBMS

机译:可扩展机器学习使用并行阵列DBMS计算数据摘要矩阵

获取原文
获取原文并翻译 | 示例

摘要

Big data analytics requires scalable (beyond RAM limits) and highly parallel (exploiting many CPU cores) processing of machine learning models, which in general involve heavy matrix manipulation. Array DBMSs represent a promising system to manipulate large matrices. With that motivation in mind, we present a high performance system exploiting a parallel array DBMS to evaluate a general, but compact, matrix summarization that benefits many machine learning models. We focus on two representative models: linear regression (supervised) and PCA (unsupervised). Our approach combines data summarization inside the parallel DBMS with further model computation in a mathematical language (e.g. R). We introduce a two-phase algorithm which first computes a general data summary in parallel and then evaluates matrix equations with reduced intermediate matrices in main memory on one node. We present theory results characterizing speedup and time/space complexity. From a parallel data system perspective, we consider scale-up and scale-out in a shared-nothing architecture. In contrast to most big data analytic systems, our system is based on array operators programmed in C++, working directly on the Unix file system instead of Java or Scala running on HDFS mounted of top of Unix, resulting in much faster processing. Experiments compare our system with Spark (parallel) and R (single machine), showing orders of magnitude time improvement. We present parallel benchmarks varying number of threads and processing nodes. Our two-phase approach should motivate analysts to exploit a parallel array DBMS for matrix summarization.
机译:大数据分析需要可扩展的(超出RAM限制),并高度平行(利用许多CPU核心)处理机器学习模型,这通常涉及重型矩阵操纵。数组DBMS表示有希望的系统来操纵大矩阵。通过考虑到这种动机,我们展示了一种高性能系统,利用并行阵列DBMS来评估一般但紧凑,矩阵摘要,这些概况有益于许多机器学习模型。我们专注于两个代表性模型:线性回归(监督)和PCA(无人监督)。我们的方法将并行DBMS内部的数据摘要与数学语言(例如R)的进一步计算相结合。我们介绍了一种双相算法,该两相算法首先并行计算一般数据摘要,然后在一个节点上的主存储器中评估矩阵方程。我们提出了理论结果,表征了加速和时间/空间复杂性。从并行数据系统的角度来看,我们考虑在共享的架构中进行扩展和扩展。与大多数大数据分析系统相比,我们的系统基于C ++中编程的阵列运算符,直接在UNIX文件系统上工作而不是在UNIX顶部安装的HDF上运行的Java或Scala,从而更快地处理。实验将我们的系统与火花(并联)和R(单机)进行了比较,显示了数量级时间改进的秩序。我们提出了不同数量的线程和处理节点的并行基准。我们的两相方法应激励分析师利用并行阵列DBMS进行矩阵摘要。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号