...
首页> 外文期刊>SIGMOD record >Implicit Parallelism through Deep Language Embedding
【24h】

Implicit Parallelism through Deep Language Embedding

机译:通过深度语言嵌入隐式并行

获取原文
获取原文并翻译 | 示例
           

摘要

Parallel collection processing based on second-order functions such as map and reduce has been widely adopted for scalable data analysis. Initially popularized by Google, over the past decade this programming paradigm has found its way in the core APIs of parallel dataflow engines such as Hadoop's MapReduce, Spark's RDDs, and Flink's DataSets. We review programming patterns typical of these APIs and discuss how they relate to the underlying parallel execution model. We argue that fixing the abstraction leaks exposed by these patterns will reduce the cost of data analysis due to improved programmer productivity. To achieve that, we first revisit the algebraic foundations of parallel collection processing. Based on that, we propose a simplified API that (i) provides proper support for nested collection processing and (ii) alleviates the need of certain second-order primitives through comprehensions - a declarative syntax akin to SQL. Finally, we present a metaprogramming pipeline that performs algebraic rewrites and physical optimizations which allow us to target parallel dataflow engines like Spark and Flink with competitive performance.
机译:基于可映射数据分析等二阶函数的并行收集处理已被广泛采用。在谷歌最初普及之后,在过去的十年中,这种编程范例已在并行数据流引擎(如Hadoop的MapReduce,Spark的RDD和Flink的DataSet)的核心API中找到了自己的方式。我们将回顾这些API的典型编程模式,并讨论它们与底层并行执行模型之间的关系。我们认为,由于提高了程序员的生产率,解决这些模式所暴露的抽象泄漏将降低数据分析的成本。为此,我们首先回顾并行收集处理的代数基础。在此基础上,我们提出了一种简化的API,该API(i)为嵌套集合处理提供适当的支持,并且(ii)通过理解(类似于SQL的声明性语法)减轻某些二阶基元的需要。最后,我们提供了一个元代编程管道,该管道执行代数重写和物理优化,使我们能够针对并行数据流引擎(例如Spark和Flink)提供具有竞争力的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号