...
【24h】

Statistical Model Computation with UDFs

机译:使用UDF进行统计模型计算

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Statistical models are generally computed outside a DBMS due to their mathematical complexity. We introduce techniques to efficiently compute fundamental statistical models inside a DBMS exploiting User-Defined Functions (UDFs). Specifically, we study the computation of linear regression, PCA, clustering, and Naive Bayes. Two summary matrices on the data set are mathematically shown to be essential for all models: the linear sum of points and the quadratic sum of cross products of points. We consider two layouts for the input data set: horizontal and vertical. We first introduce efficient SQL queries to compute summary matrices and score the data set. Based on the SQL framework, we introduce UDFs that work in a single table scan: aggregate UDFs to compute summary matrices for all models and a set of primitive scalar UDFs to score data sets. Experiments compare UDFs and SQL queries (running inside the DBMS) with C++ (analyzing exported files). In general, UDFs are faster than SQL queries and not much slower than C++. Considering export times, C++ is slower than UDFs and SQL queries. Statistical models based on precomputed summary matrices are computed in a few seconds. UDFs scale linearly and only require one table scan, highlighting their efficiency.
机译:由于统计模型的数学复杂性,通常在DBMS外部进行计算。我们介绍了利用用户定义函数(UDF)在DBMS内部有效计算基本统计模型的技术。具体来说,我们研究线性回归,PCA,聚类和朴素贝叶斯的计算。数学上显示了数据集上的两个汇总矩阵对于所有模型都是必不可少的:点的线性和和点的叉积的二次和。我们考虑输入数据集的两种布局:水平和垂直。我们首先介绍高效的SQL查询,以计算汇总矩阵并对数据集进行评分。基于SQL框架,我们介绍了可在单个表扫描中工作的UDF:聚合UDF以计算所有模型的汇总矩阵,以及一组原始标量UDF来对数据集进行评分。实验将UDF和SQL查询(在DBMS内部运行)与C ++(分析导出的文件)进行了比较。通常,UDF比SQL查询要快,并且不比C ++要慢。考虑到导出时间,C ++比UDF和SQL查询要慢。在几秒钟内即可计算出基于预先计算的汇总矩阵的统计模型。 UDF线性缩放,只需要进行一次表扫描即可,突出了其效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号