Indexing for Large Scale Data Querying Based on Spark SQL

机译：基于Spark SQL的大规模数据查询索引

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Spark SQL lets spark programmers query structured data inside Spark programs using SQL statements. It provides spark programmers with great convenience to leverage the benefits of relational processing, and its internal RDD distributed processing also accelerates query on large data sets. However, Spark SQL is not designed for long-run services and its built-in data source would load data from storage system, such as HDFS and local file system, in each table scan without cache mechanism. Although users could keep data in memory using "cache" command explicitly, the data cached in memory is coarse grained. In this paper, we present an indexing structure which is a pluggable component of Spark SQL based on Apache Spark. Compared with Spark SQL, it has some additional advantages. Firstly, it allows users to create index of structured data to be processed, which speeds up the query performance greatly. Secondly, it enables programmers to load fine-grained data file of structured data into memory, which is flexible to load "hot data" into memory and to evict "cold data" out of memory.

机译：Spark SQL使Spark程序员可以使用SQL语句在Spark程序中查询结构化数据。它为Spark程序员提供了极大的便利，可以利用关系处理的好处，并且其内部的RDD分布式处理还可以加速对大型数据集的查询。但是，Spark SQL并不是为长期服务而设计的，它的内置数据源将在没有缓存机制的每次表扫描中从存储系统（例如HDFS和本地文件系统）加载数据。尽管用户可以使用“ cache”命令显式地将数据保留在内存中，但是缓存在内存中的数据是粗粒度的。在本文中，我们提出了一种索引结构，该结构是基于Apache Spark的Spark SQL的可插入组件。与Spark SQL相比，它具有一些其他优点。首先，它允许用户创建要处理的结构化数据的索引，从而极大地提高了查询性能。其次，它使程序员能够将结构化数据的细粒度数据文件加载到内存中，从而可以灵活地将“热数据”加载到内存中并从内存中逐出“冷数据”。

著录项

来源
《IEEE International Conference on E-Business Engineering》|2017年|103-108|共6页
会议地点
作者
Yi Cui; Guoqiang Li; Hao Cheng; Daoyuan Wang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Sparks; Big Data; Optimization; Indexing; Structured Query Language; Acceleration;

机译：火花;大数据;优化;索引;结构化查询语言;加速;
入库时间 2022-08-26 13:48:42

相似文献

外文文献
中文文献
专利

1. An adaptive spark-based framework for querying large-scale NoSQL and relational databases [J] . Eman Khashan, Ali Eldesouky, Sally Elghamrawy PLoS One . 2021,第8期

机译：用于查询大型NoSQL和关系数据库的自适应火花框架
2. Indexing and querying algorithm based on structure indexing for managing massive-scale RDF data [J] . Minho Bae, Jangsu Kihm, Sanggil Kang, Journal of intelligent & fuzzy systems: Applications in Engineering and Technology . 2014,第2期

机译：基于结构索引的索引查询算法用于大规模RDF数据管理
3. A Design of High-speed Big Data Query Processing System for Social Data Analysis: Using Spark SQL [J] . Kiejin Park, Limei Peng International Journal of Applied Engineering Research . 2016,第14aPta2期

机译：用于社交数据分析的高速大数据查询处理系统设计：使用Spark SQL
4. Indexing for Large Scale Data Querying Based on Spark SQL [C] . Yi Cui, Guoqiang Li, Hao Cheng, IEEE International Conference on e-Business Engineering . 2017

机译：基于Spark SQL的大规模数据查询索引
5. Scalable Conversion of Textual Unstructured Data to NoSQL Graph Representation Using Berkeley DB Key-Value Store for Efficient Querying [D] . Varghese, Jasmine Manoj. 2017

机译：使用Berkeley DB键值存储将文本非结构化数据可扩展转换为NoSQL图形表示形式，以实现高效查询
6. An adaptive spark-based framework for querying large-scale NoSQL and relational databases [O] . Eman Khashan, Ali Eldesouky, Sally Elghamrawy 2021

机译：用于查询大型NoSQL和关系数据库的自适应火花基框架
7. Supporting Queries and Analyses of Large-Scale Social Media Data with Customizable and Scalable Indexing Techniques over NoSQL Databases [O] . Xiaoming Gao, Judy Qiu 2015

机译：通过NosQL数据库上的可定制和可扩展索引技术支持对大规模社交媒体数据的查询和分析
8. Analyzing Enron Data: Bitmap Indexing Outperforms MySQL Queries by Several Orders of Magnitude [R] . Stockinger, K. 2006

机译：分析安然数据：位图索引优于几个数量级的mysQL查询

Indexing for Large Scale Data Querying Based on Spark SQL

摘要

著录项

相似文献

相关主题

期刊订阅