首页> 外文会议>International conference on very large data bases >Are Key-Foreign Key Joins Safe to Avoid when Learning High-Capacity Classifiers?
【24h】

Are Key-Foreign Key Joins Safe to Avoid when Learning High-Capacity Classifiers?

机译:学习大容量分类器时,是否可以安全避免使用外键-钥匙联接?

获取原文

摘要

Machine learning (ML) over relational data is a booming area of data management. While there is a lot of work on scalable and fast ML systems, little work has addressed the pains of sourcing data for ML tasks. Real-world relational databases typically have many tables (often, dozens) and data scientists often struggle to even obtain all tables for joins before ML. In this context, Kumar et al. showed recently that, key-foreign key dependencies (KFKDs) between tables often lets us avoid such joins without significantly affecting prediction accuracy—an idea they called "avoiding joins safely." While initially controversial, this idea has since been used by multiple companies to reduce the burden of data soureing for ML. But their work applied only to linear classifiers. In this work, we verify if their results hold for three popular high-capacity classifiers: decision trees, non-linear SVMs. and AXNs. We conduct an extensive experimental study using both real-world datasets and simulations to analyze the effects of avoiding KFK joins on such models. Our results show that these high-capacity classifiers are surprisingly and counter-intuitively more robust to avoiding KFK joins compared to linear classifiers, refuting an intuition from the prior work's analysis. We explain this behavior intuitively and identify open questions at the intersection of data management and ML theoretical research.
机译:关系数据上的机器学习(ML)是数据管理的新兴领域。尽管在可伸缩和快速的ML系统上进行了大量工作,但很少有工作解决了为ML任务寻找数据的麻烦。现实世界中的关系数据库通常具有许多表(通常是几十个),并且数据科学家通常甚至很难在ML之前获取所有表进行联接。在这种情况下,Kumar等人。最近显示,表之间的键-外键依赖关系(KFKD)通常使我们能够避免这种联接而又不会显着影响预测准确性-他们称之为“安全避免联接”的想法。尽管最初引起争议,但此想法已被多家公司用来减轻ML的数据获取负担。但是他们的工作仅适用于线性分类器。在这项工作中,我们验证了它们的结果是否适用于三种流行的大容量分类器:决策树,非线性SVM。和AXN。我们使用现实世界的数据集和模拟进行了广泛的实验研究,以分析避免在此类模型上使用KFK联接的影响。我们的结果表明,与线性分类器相比,这些高容量分类器在避免KFK联接方面出人意料地且反直觉上更强大,从而避免了先前工作的分析得出的直觉。我们直观地解释此行为,并在数据管理和ML理论研究的交集中找出未解决的问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号