首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Declarative Parameterizations of User-Defined Functions for Large-Scale Machine Learning and Optimization
【24h】

Declarative Parameterizations of User-Defined Functions for Large-Scale Machine Learning and Optimization

机译:大规模机器学习和优化的用户定义函数的声明性参数化

获取原文
获取原文并翻译 | 示例

摘要

Large-scale optimization has become an important application for data management systems, particularly in the context of statistical machine learning. In this paper, we consider how one might implement the join-and-co-group pattern in the context of a fully declarative data processing system. The join-and-co-group pattern is ubiquitous in iterative, large-scale optimization. In the join-and-co-group pattern, a user-defined function g is parameterized with a data object x as well as the subset of the statistical model Theta(x) that applies to that object, so that g(x vertical bar Theta(x)) can be used to compute a partial update of the model. This is repeated for every x in the full data set X. All partial updates are then aggregated and used to perform a complete update of the model. The join-and-co-group pattern has several implementation challenges, including the potential for a massive blow-up in the size of a fully parameterized model. Thus, unless the correct physical execution plan be chosen for implementing the join-and-co-group pattern, it is easily possible to have an execution that takes a very long time or even fails to complete. In this paper, we carefully consider the alternatives for implementing the join-and-co-group pattern on top of a declarative system, as well as how the best alternative can be selected automatically. Our focus is on the SimSQL database system, which is an SQL-based system with special facilities for large-scale, iterative optimization. Since it is an SQL-based system with a query optimizer, those choices can be made automatically.
机译:大规模优化已成为数据管理系统的重要应用程序,特别是在统计机器学习的情况下。在本文中,我们考虑了如何在完全声明性的数据处理系统的背景下实现联接和共同组模式。联合组模式在迭代,大规模优化中无处不在。在联接和共同组模式中,使用数据对象x以及适用于该对象的统计模型Theta(x)的子集对用户定义的函数g进行参数化,因此g(x竖线Theta(x))可用于计算模型的部分更新。对完整数据集X中的每个x重复此操作。然后汇总所有部分更新并用于执行模型的完整更新。加入并联合模式有一些实施方面的挑战,包括完全参数化模型的规模可能会急剧膨胀。因此,除非选择正确的物理执行计划来实施加入并合作组模式,否则很可能会花费很长时间甚至无法完成执行。在本文中,我们仔细考虑了在声明式系统之上实施联接和共同组模式的替代方法,以及如何自动选择最佳替代方法。我们的重点是SimSQL数据库系统,这是一个基于SQL的系统,具有用于大规模迭代优化的特殊功能。由于它是带有查询优化器的基于SQL的系统,因此可以自动进行那些选择。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号