首页> 外文会议>SIGMOD/PODS >Building Statistical Models and Scoring with UDFs
【24h】

Building Statistical Models and Scoring with UDFs

机译:建立统计模型和与UDF的评分

获取原文
获取外文期刊封面目录资料

摘要

Multidimensional statistical models are generally computed outside a relational DBMS, exporting data sets. This article explains how fundamental multidimensional statistical models are computed inside the DBMS in a single table scan exploiting SQL and User-De?ned Functions (UDFs). The techniques described herein are used in a commercial data mining tool, called Teradata Warehouse Miner. Speci?cally, we explain how correlation, linear regression, PCA and clustering, are integrated into the Teradata DBMS. Two major database processing tasks are discussed: building a model and scoring a data set based on a model. To build a model two summary matrices are shown to be common and essential for all linear models: the linear sum of points and the quadratic sum of cross-products of points. Since such matrices are generally signi?cantly smaller than the data set, we explain how the remaining matrix operations to build the model can be quickly performed outside the DBMS. We ?rst explain how to efficiently compute summary matrices with plain SQL queries. Then we present two sets of UDFs that work in a single table scan: an aggregate UDF to compute summary matrices and a set of scalar UDFs to score data sets. Experiments compare UDFs and SQL queries (running inside the DBMS) with C++ (running outside on exported ?les). In general, UDFs are faster than SQL queries and UDFs are more efficient than C++, due to long export times. Statistical models based on the summary matrices can be built outside the DBMS in just a few seconds. Aggregate and scalar UDFs scale linearly and require only one table scan, making them ideal to process large data sets.
机译:多维统计模型通常在关系DBMS之外计算,导出数据集。本文介绍了在单个表扫描利用SQL和用户de?ned函数(UDFS)中的DBMS内计算基本多维统计模型。这里描述的技术用于商业数据挖掘工具,称为Teradata仓库矿工。 Speci?cally,我们解释了如何关联,线性回归,PCA和群集集成到Teradata DBMS中。讨论了两个主要数据库处理任务:构建模型并基于模型进行评分数据集。为了构建模型,两个汇总矩阵被示出为常见的,并且对于所有线性模型是必不可少的:点的线性和点的直线和点的二次总和。由于这种矩阵通常是Signi?总是小于数据集,因此我们解释了如何在DBMS之外快速执行构建模型的剩余矩阵操作。我们首先解释如何用普通的SQL查询有效地计算摘要矩阵。然后我们呈现两组在单个表扫描中工作的UDF:聚合UDF来计算摘要矩阵和一组标量UDF以进行评分数据集。实验比较UDFS和SQL查询(在DBMS内部运行)使用C ++(在导出的输出时运行)。通常,由于出口时间长时间,UDFS比SQL查询更快,UDF比C ++更有效。基于汇总矩阵的统计模型可以在短短几秒钟内在DBMS之外构建。汇总和标量UDFS线性缩放,只需要一个表扫描,使其成为处理大数据集的理想选择。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号