首页> 外文会议>International conference on very large data bases >Execution Primitives for Scalable Joins and Aggregations in Map Reduce
【24h】

Execution Primitives for Scalable Joins and Aggregations in Map Reduce

机译:Map Reduce中可伸缩联接和聚合的执行原语

获取原文

摘要

Analytics on Big Data is critical to derive business insights and drive innovation in today's Internet companies. Such analytics involve complex computations on large datasets, and are typically performed on MapReduce based frameworks such as Hive and Pig. However, in our experience, these systems are still quite limited in performing at scale. In particular, calculations that involve complex joins and aggregations, e.g. statistical calculations, scale poorly on these systems. In this paper we propose novel primitives for scaling such calculations. We propose a new data model for organizing datasets into calculation data units that are organized based on user-defined cost functions. We propose new operators that take advantage of these organized data units to significantly speed up joins and aggregations. Finally, we propose strategies for dividing the aggregation load uniformly across worker processes that are very effective in avoiding skews and reducing (or in some cases even removing) the associated overheads. We have implemented all our proposed primitives in a framework called Rubix, which has been in production at Linkedln for nearly a year. Rubix powers several applications and processes TBs of data each day. We have seen remarkable improvements in speed and cost of complex calculations due to these primitives.
机译:在当今的互联网公司中,大数据分析对于获取业务见解和推动创新至关重要。这种分析涉及对大型数据集的复杂计算,并且通常在基于MapReduce的框架(例如Hive和Pig)上执行。但是,根据我们的经验,这些系统在大规模执行方面仍然非常有限。特别是涉及复杂联接和聚合的计算,例如统计计算,在这些系统上的伸缩性很差。在本文中,我们提出了用于扩展此类计算的新颖原语。我们提出了一种新的数据模型,用于将数据集组织为基于用户定义的成本函数进行组织的计算数据单元。我们建议使用这些有组织的数据单元的新运算符来显着加快联接和聚合的速度。最后,我们提出了在工作进程之间均匀划分聚合负载的策略,这对于避免偏差和减少(或在某些情况下甚至消除)相关的开销非常有效。我们已经在一个称为Rubix的框架中实现了所有建议的原语,该框架已在Linkedln生产了将近一年。 Rubix每天为多个应用程序供电并处理TB的数据。由于这些原语,我们已经看到了复杂计算速度和成本的显着提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号