Big Data Analytics Integrating a Parallel Columnar DBMS and the R Language

机译：大数据分析集成了并行列式DBMS和R语言

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Most research has proposed scalable and parallel analytic algorithms that work outside a DBMS. On the other hand, R has become a very popular system to perform machine learning analysis, but it is limited by main memory and single-threaded processing. Recently, novel columnar DBMSs have shown to provide orders of magnitude improvement in SQL query processing speed, preserving the parallel speedup of row-based parallel DBMSs. With that motivation in mind, we present COLUMNAR, a system integrating a parallel columnar DBMS and R, that can directly compute models on large data sets stored as relational tables. Our algorithms are based on a combination of SQL queries, user-defined functions (UDFs) and R calls, where SQL queries and UDFs compute data set summaries that are sent to R to compute models in RAM. Since our hybrid algorithms exploit the DBMS for the most demanding computations involving the data set, they show linear scalability and are highly parallel. Our algorithms generally require one pass on the data set or a few passes otherwise (i.e. fewer passes than traditional methods). Our system can analyze data sets faster than R even when they fit in RAM and it also eliminates memory limitations in R when data sets exceed RAM size. On the other hand, it is an order of magnitude faster than Spark (a prominent Hadoop system) and a traditional row-based DBMS.

机译：大多数研究提出了可在DBMS外部运行的可伸缩和并行分析算法。另一方面，R已成为执行机器学习分析的非常流行的系统，但是它受到主内存和单线程处理的限制。近来，新颖的列式DBMS已显示出可以提高SQL查询处理速度的数量级，同时保持了基于行的并行DBMS的并行加速。考虑到这种动机，我们提出了COLUMNAR，这是一个将并行列式DBMS和R集成在一起的系统，它可以直接在存储为关系表的大型数据集上计算模型。我们的算法基于SQL查询，用户定义函数（UDF）和R调用的组合，其中SQL查询和UDF计算数据集摘要，这些摘要被发送到R以在RAM中计算模型。由于我们的混合算法利用DBMS进行涉及数据集的最苛刻的计算，因此它们显示出线性可伸缩性并且高度并行。我们的算法通常需要对数据集进行一次处理，否则需要进行几次处理（即与传统方法相比，处理次数较少）。即使数据集适合RAM，我们的系统也可以比R更快地分析数据集;当数据集超过RAM大小时，它也消除了R中的内存限制。另一方面，它比Spark（杰出的Hadoop系统）和传统的基于行的DBMS快一个数量级。

著录项

来源
《IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing》|2016年|627-630|共4页
会议地点
作者
Yiqun Zhang; Carlos Ordonez; Wellington Cabrera;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Mathematical model; Computational modeling; Random access memory; Data models; Load modeling; Layout; Numerical models;

机译：数学模型;计算模型;随机存取存储器;数据模型;负载模型;布局;数值模型;

相似文献

外文文献
中文文献
专利

1. Integrating task parallelism in data parallel languages for parallel programming on NOWs [J] . K.J.Binu, D.Janaki Ram CONCURRENCY PRACTICE & EXPERIENCE . 2000,第13期

机译：在数据并行语言中集成任务并行性，以便在NOW上进行并行编程
2. Workload decomposition strategies for hierarchical distributed-shared memory parallel systems and their implementation with integration of high-level parallel languages [J] . Sergio Briguglio, Beniamino Di Martino, Gregorio Vlad Concurrency and Computation . 2002,第11期

机译：分层分布式共享内存并行系统的工作量分解策略及其与高级并行语言集成的实现
3. Workload decomposition strategies for hierarchical distributed-shared memory parallel systems and their implementation with integration of high-level parallel languages [J] . Sergio Briguglio, Beniamino Di Martino, Gregorio Vlad Concurrency and Computation . 2002,第11期

机译：分层分布式共享内存并行系统的工作量分解策略及其与高级并行语言集成的实现
4. Big Data Analytics Integrating a Parallel Columnar DBMS and the R Language [C] . Yiqun Zhang, Carlos Ordonez, Wellington Cabrera IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing . 2016

机译：大数据分析集成并行柱状DBMS和R语言
5. Dataflow Synthesis and Verification for Parallel Object-Oriented Programming Languages. [D] . Wu, Shuang. 2011

机译：并行面向对象编程语言的数据流综合和验证。
6. EST2uni: an open parallel tool for automated EST analysis and database creation with a data mining web interface and microarray expression data integration [O] . Javier Forment, Francisco Gilabert, Antonio Robles, 2008

机译：EST2uni：开放式并行工具用于自动化EST分析和数据库创建具有数据挖掘Web界面和微阵列表达数据集成
7. Using Data Compression for Increasing Efficiency of Data Transfer Between Main Memory and Intel Xeon Phi Coprocessor or NVidia GPU in Parallel DBMS [O] . Besedin Konstantin Y., Kostenetskiy Pavel S., Prikazchikov Stepan O. 2015

机译：使用数据压缩来提高并行DBMS中主内存与Intel Xeon Phi协处理器或NVidia GPU之间的数据传输效率

Big Data Analytics Integrating a Parallel Columnar DBMS and the R Language

摘要

著录项

相似文献

相关主题

期刊订阅