High Performance Data Engineering Everywhere

机译：到处高性能数据工程

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single node's ability to provide. However this is just a small part of the issues facing the overall data processing environment, which must also support a raft of data engineering for pre- and post-data processing, communication, and system integration. An important requirement of data analytics tools is to be able to easily integrate with existing frameworks in a multitude of languages, thereby increasing user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of today's popular data analytics tools are unable to satisfy all these requirements at the same time. In this paper we present Cylon, an open-source high performance distributed data processing library that can be seamlessly integrated with existing Big Data and AI/ML frameworks. It is developed with a flexible C++ core on top of a compact data structure and exposes language bindings to C++, Java, and Python. We discuss Cylon's architecture in detail, and reveal how it can be imported as a library to existing applications or operate as a standalone framework. Initial experiments show that Cylon enhances popular tools such as Apache Spark and Dask with major performance improvements for key operations and better component linkages. Finally, we show how its design enables Cylon to be used cross-platform with minimum overhead, which includes popular AI tools such as PyTorch, Tensorflow, and Jupyter notebooks.

机译：在机器和深度学习领域所做的惊人进展是企业和研究社区的大数据时代的亮点。现代应用需要超出单个节点提供的资源。然而，这只是整体数据处理环境面临的问题的一小部分，它还必须支持数据工程的筏，以进行数据预处理和后期处理，通信和系统集成。数据分析工具的重要要求是能够轻松地与众多语言中的现有框架集成，从而提高用户的生产率和效率。所有这些都要求有效且高度分布式的数据处理综合方法，但今天许多流行的数据分析工具都无法同时满足所有这些要求。在本文中，我们呈现Cylon，一个开源高性能分布式数据处理库，可以与现有的大数据和AI / ML框架无缝集成。它是在紧凑的数据结构之上使用灵活的C ++核开发，并对C ++，Java和Python公开语言绑定。我们详细讨论Cyron的架构，并揭示了如何将其作为库导入现有应用程序或作为独立框架操作。初始实验表明，随着关键操作和更好的组件联系，气缸提高了Apache Spark和DASK，如Apache Spark和DASK。最后，我们展示了它的设计如何实现充满活力的跨平台，其中包含最小的开销，包括流行的AI工具，如Pytorch，Tensorflow和Jupyter笔记本电脑。

著录项

来源
《IEEE International Conference on Smart Data Services》|2020年|122-132|共11页
会议地点
作者
Chathura Widanage; Niranda Perera; Vibhatha Abeykoon; Supun Kamburugamuve; Thejaka Amila Kanewala; Hasara Maithree; Pulasthi Wickramasinghe; Ahmet Uyar; Gurhan Gunduz; Geoffrey Fox;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Data analysis; C++ languages; System integration; Tools; Big Data; Data engineering; Libraries;

机译：数据分析;C ++语言;系统集成;工具;大数据;数据工程;图书馆;

相似文献

外文文献
中文文献
专利

1. Prediction of Students’ Performances Using Course Analytics Data: A Case of Water Engineering Course at the University of South Australia [J] . Faisal Ahammed, Elizabeth Smith Education Sciences . 2019,第3期

机译：使用课程分析数据预测学生的表现：以南澳大利亚大学的水工程课程为例
2. Learning analytics for smart campus: Data on academic performances of engineering undergraduates in Nigerian private university [J] . Segun I. Popoola, Aderemi A. Atayero, Joke A. Badejo, Data in Brief . 2018,第1期

机译：智能校园的学习分析：尼日利亚私立大学工科学生的学习成绩数据
3. Data mining to increase teaching performance in engineering education [J] . Dominik Strzalka Computing reviews . 2021,第1期

机译：数据挖掘提高工程教育教学表现
4. Performance engineering for EA systems in next generation data centresPerformance engineering for EA systems in next generation data centres [C] . Jerome Rolia, Ludmila Cherkasova, Richard Friedrich, International workshop on Software and performance . 2007

机译：下一代数据中心的EA系统的性能工程下一代数据中心的EA系统的性能工程
5. Towards Data Analytics-Aware High Performance Data Engineering and Benchmarking [D] . Abeykoon, Vibhatha Lakmal. 2021

机译：走向数据分析的高性能数据工程和基准测试
6. Learning analytics for smart campus: Data on academic performances of engineering undergraduates in Nigerian private university [O] . Segun I. Popoola, Aderemi A. Atayero, Joke A. Badejo, 2018

机译：智能校园的学习分析：尼日利亚私立大学工科学生的学习成绩数据
7. Reengineering human performance and fatigue research through use of physiological monitoring devices, web-based and mobile device data collection methods, and integrated data storage techniques [O] . Patillo Paul L., OConnor Maureen J. 2003

机译：通过使用生理监测设备，基于Web和移动设备的数据收集方法以及集成的数据存储技术来重新设计人类的表现和疲劳研究

High Performance Data Engineering Everywhere

摘要

著录项

相似文献

相关主题

期刊订阅