首页> 外文会议>IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing >Big Data Analytics Integrating a Parallel Columnar DBMS and the R Language
【24h】

Big Data Analytics Integrating a Parallel Columnar DBMS and the R Language

机译:大数据分析集成并行柱状DBMS和R语言

获取原文

摘要

Most research has proposed scalable and parallel analytic algorithms that work outside a DBMS. On the other hand, R has become a very popular system to perform machine learning analysis, but it is limited by main memory and single-threaded processing. Recently, novel columnar DBMSs have shown to provide orders of magnitude improvement in SQL query processing speed, preserving the parallel speedup of row-based parallel DBMSs. With that motivation in mind, we present COLUMNAR, a system integrating a parallel columnar DBMS and R, that can directly compute models on large data sets stored as relational tables. Our algorithms are based on a combination of SQL queries, user-defined functions (UDFs) and R calls, where SQL queries and UDFs compute data set summaries that are sent to R to compute models in RAM. Since our hybrid algorithms exploit the DBMS for the most demanding computations involving the data set, they show linear scalability and are highly parallel. Our algorithms generally require one pass on the data set or a few passes otherwise (i.e. fewer passes than traditional methods). Our system can analyze data sets faster than R even when they fit in RAM and it also eliminates memory limitations in R when data sets exceed RAM size. On the other hand, it is an order of magnitude faster than Spark (a prominent Hadoop system) and a traditional row-based DBMS.
机译:大多数研究都提出了在DBMS之外工作的可扩展和并行分析算法。另一方面,R已成为执行机器学习分析的非常流行的系统,但它受主内存和单线程处理的限制。最近,新的柱状DBMS已经显示为SQL查询处理速度的级别提高,保留了基于行的并行加速度的并行加速。通过考虑到这一动机,我们呈现柱状,一个系统集成了并行柱状DBMS和R的系统,可以直接计算存储为关系表的大数据集上的模型。我们的算法基于SQL查询,用户定义的函数(UDFS)和R调用的组合,其中SQL查询和UDFS将被发送到R的数据集摘要计算为计算RAM中的模型。由于我们的混合算法利用DBMS获取涉及数据集的最苛刻的计算,因此它们显示了线性可伸缩性并高度平行。我们的算法通常需要一个传递数据集或否则否则少量通过(即,比传统方法少的通过)。即使它们适用于RAM,我们的系统也可以将数据集分析比R快于r,并且当数据集超过RAM尺寸时,它也消除了R中的内存限制。另一方面,它比火花(突出的Hadoop系统)和基于传统的行的DBMS更快的数量级。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号