首页> 外文会议>IEEE International Conference on Smart Data Services >High Performance Data Engineering Everywhere
【24h】

High Performance Data Engineering Everywhere

机译:到处高性能数据工程

获取原文

摘要

The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single node's ability to provide. However this is just a small part of the issues facing the overall data processing environment, which must also support a raft of data engineering for pre- and post-data processing, communication, and system integration. An important requirement of data analytics tools is to be able to easily integrate with existing frameworks in a multitude of languages, thereby increasing user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of today's popular data analytics tools are unable to satisfy all these requirements at the same time. In this paper we present Cylon, an open-source high performance distributed data processing library that can be seamlessly integrated with existing Big Data and AI/ML frameworks. It is developed with a flexible C++ core on top of a compact data structure and exposes language bindings to C++, Java, and Python. We discuss Cylon's architecture in detail, and reveal how it can be imported as a library to existing applications or operate as a standalone framework. Initial experiments show that Cylon enhances popular tools such as Apache Spark and Dask with major performance improvements for key operations and better component linkages. Finally, we show how its design enables Cylon to be used cross-platform with minimum overhead, which includes popular AI tools such as PyTorch, Tensorflow, and Jupyter notebooks.
机译:在机器和深度学习领域所做的惊人进展是企业和研究社区的大数据时代的亮点。现代应用需要超出单个节点提供的资源。然而,这只是整体数据处理环境面临的问题的一小部分,它还必须支持数据工程的筏,以进行数据预处理和后期处理,通信和系统集成。数据分析工具的重要要求是能够轻松地与众多语言中的现有框架集成,从而提高用户的生产率和效率。所有这些都要求有效且高度分布式的数据处理综合方法,但今天许多流行的数据分析工具都无法同时满足所有这些要求。在本文中,我们呈现Cylon,一个开源高性能分布式数据处理库,可以与现有的大数据和AI / ML框架无缝集成。它是在紧凑的数据结构之上使用灵活的C ++核开发,并对C ++,Java和Python公开语言绑定。我们详细讨论Cyron的架构,并揭示了如何将其作为库导入现有应用程序或作为独立框架操作。初始实验表明,随着关键操作和更好的组件联系,气缸提高了Apache Spark和DASK,如Apache Spark和DASK。最后,我们展示了它的设计如何实现充满活力的跨平台,其中包含最小的开销,包括流行的AI工具,如Pytorch,Tensorflow和Jupyter笔记本电脑。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号