Statistical Model Computation with UDFs

Ordonez Carlos

首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Statistical Model Computation with UDFs

【24h】

Statistical Model Computation with UDFs

机译：使用UDF进行统计模型计算

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Statistical models are generally computed outside a DBMS due to their mathematical complexity. We introduce techniques to efficiently compute fundamental statistical models inside a DBMS exploiting User-Defined Functions (UDFs). Specifically, we study the computation of linear regression, PCA, clustering, and Naive Bayes. Two summary matrices on the data set are mathematically shown to be essential for all models: the linear sum of points and the quadratic sum of cross products of points. We consider two layouts for the input data set: horizontal and vertical. We first introduce efficient SQL queries to compute summary matrices and score the data set. Based on the SQL framework, we introduce UDFs that work in a single table scan: aggregate UDFs to compute summary matrices for all models and a set of primitive scalar UDFs to score data sets. Experiments compare UDFs and SQL queries (running inside the DBMS) with C++ (analyzing exported files). In general, UDFs are faster than SQL queries and not much slower than C++. Considering export times, C++ is slower than UDFs and SQL queries. Statistical models based on precomputed summary matrices are computed in a few seconds. UDFs scale linearly and only require one table scan, highlighting their efficiency.

机译：由于统计模型的数学复杂性，通常在DBMS外部进行计算。我们介绍了利用用户定义函数（UDF）在DBMS内部有效计算基本统计模型的技术。具体来说，我们研究线性回归，PCA，聚类和朴素贝叶斯的计算。数学上显示了数据集上的两个汇总矩阵对于所有模型都是必不可少的：点的线性和和点的叉积的二次和。我们考虑输入数据集的两种布局：水平和垂直。我们首先介绍高效的SQL查询，以计算汇总矩阵并对数据集进行评分。基于SQL框架，我们介绍了可在单个表扫描中工作的UDF：聚合UDF以计算所有模型的汇总矩阵，以及一组原始标量UDF来对数据集进行评分。实验将UDF和SQL查询（在DBMS内部运行）与C ++（分析导出的文件）进行了比较。通常，UDF比SQL查询要快，并且不比C ++要慢。考虑到导出时间，C ++比UDF和SQL查询要慢。在几秒钟内即可计算出基于预先计算的汇总矩阵的统计模型。 UDF线性缩放，只需要进行一次表扫描即可，突出了其效率。

著录项

来源
《Knowledge and Data Engineering, IEEE Transactions on》 |2010年第12期|p.1752-1765|共14页
作者
Ordonez Carlos;
展开▼
作者单位

University of Houston, Houston;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
DBMS; SQL; UDF.; statistical model;

机译：DBMS;SQL;UDF .;统计模型;

相似文献

外文文献
中文文献
专利

1. Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling [J] . Carlos Ordonez, Sasi K. Pitchaimalai Data & Knowledge Engineering . 2010,第4期

机译：快速UDF可利用缓存和采样来计算大型数据集的足够统计信息
2. Comparative analysis of the food webs of two intertidal mudflats during two seasons using inverse modelling: Aiguillon Cove and Brouage Mudflat, France [J] . Delphine Degre, Delphine Leguerrier, Eric Armynot du Chatelet, Estuarine Coastal and Shelf Science . 2006,第1a2期

机译：使用反模型对两个潮间带滩涂两个季节的食物网进行比较分析：Aiguillon Cove和Brouage Mudflat，法国
3. Modeling and Forecasting of Tourist Arrivals in Crete Using Statistical Models and Models of Computational Intelligence: A Comparative Study [J] . Stefanos K. Goumas, Stavros Kontakos, Aikaterini G. Mathheaki, International journal of operations research and information systems . 2021,第1期

机译：使用统计模型和计算智能模型的克里特岛旅游抵达的建模与预测：比较研究
4. Building statistical models and scoring with UDFs [C] . Carlos Ordonez, PCarlos Ordonez ACM SIGMOD international conference on Management of data . 2007

机译：建立统计模型并使用UDF评分
5. Statistics, science and statistical science: Modeling, inference and computation with applications to the physical sciences. [D] . Baines, Paul David. 2010

机译：统计，科学和统计科学：建模，推理和计算及其在物理科学中的应用。
6. Computational modeling of protein mutant stability: analysis and optimization of statistical potentials and structural features reveal insights into prediction model development [O] . Vijaya Parthiban, M Michael Gromiha, Madenhalli Abhinandan, 2007

机译：蛋白质突变体稳定性的计算模型：对统计潜力和结构特征的分析和优化揭示了对预测模型开发的见解
7. A socio-cognitive and computational model udfor decision making and user modelling in udsocial phishing ud [O] . Chaudhary Sunil, Berki Eleni, Li Linfeng, 100

机译：社会认知和计算模型 ud在 ud中进行决策和用户建模社交网络钓鱼 ud

Statistical Model Computation with UDFs

摘要

著录项

相似文献

相关主题

期刊订阅