首页> 外文会议>IEEE International Congress on Big Data >Compile-Time Code Generation for Embedded Data-Intensive Query Languages
【24h】

Compile-Time Code Generation for Embedded Data-Intensive Query Languages

机译:嵌入数据密集型查询语言的编译时代码生成

获取原文

摘要

Many emerging Big Data programming environments, such as Spark and Flink, provide powerful APIs that are inspired by functional programming. However, because of the complexity involved in developing and fine-tuning data analysis applications using the provided APIs, many programmers prefer to use declarative languages, such as Hive and Spark SQL, to code their distributed applications. Unfortunately, current data analysis query languages, which are typically based on the relational model, cannot effectively capture the rich data types and computations required for complex data analysis applications. Furthermore, these query languages are not well-integrated with the host programming language, as they are based on an incompatible data model, and are checked for correctness at run-time, which results in a significantly longer program development time. To address these shortcomings, we introduce a new query language for data-intensive scalable computing, called DIQL, that is deeply embedded in Scala, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time. In contrast to other query languages, our query embedding eliminates impedance mismatch as any Scala code can be seamlessly mixed with SQL-like syntax, without having to add any special declaration. DIQL supports nested collections and hierarchical data and allows query nesting at any place in a query. With DIQL, programmers can express complex data analysis tasks, such as PageRank and matrix factorization, using SQL-like syntax exclusively. The DIQL query optimizer can find any possible join in a query, including joins hidden across deeply nested queries, thus unnesting any form of query nesting. Currently, DIQL can run on three Big Data platforms: Apache Spark, Apache Flink, and Twitter's Cascading/Scalding.
机译:许多新兴的大数据编程环境,如火花和传递,提供了强大的API,它受到功能编程的启发。但是,由于使用提供的API开发和微调数据分析应用程序所涉及的复杂性,许多程序员更喜欢使用声明性语言,例如Hive和Spark SQL,以代码其分布式应用程序。遗憾的是,通常基于关系模型的当前数据分析查询语言无法有效地捕获复杂数据分析应用程序所需的丰富数据类型和计算。此外,这些查询语言与主机编程语言没有充分集成,因为它们基于不兼容的数据模型,并在运行时检查正确性,这导致了更长的程序开发时间。为了解决这些缺点,我们向数据密集可扩展计算的新查询语言介绍了DIQL,它深度嵌入Scala,以及查询优化框架,可在编译时优化和将DIQL查询转换为字节代码。与其他查询语言相比,我们的查询嵌入消除了阻抗不匹配,因为任何SCALA代码都可以与SQL类似的语法无缝混合,而无需添加任何特殊声明。 DIQL支持嵌套的集合和分层数据,并允许在查询中的任何位置屏蔽。对于DIQL,程序员可以专门使用SQL样语法表达复杂的数据分析任务,例如PageRank和矩阵分解。 DIQL查询优化器可以在查询中找到任何可能的连接,包括隐藏在深度嵌套查询中的连接,从而不确定任何形式的查询嵌套。目前,DIQL可以在三个大数据平台上运行:Apache Spark,Apache Flink和Twitter的级联/烫伤。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号