首页> 外文会议>IEEE International Conference on E-Business Engineering >Indexing for Large Scale Data Querying Based on Spark SQL
【24h】

Indexing for Large Scale Data Querying Based on Spark SQL

机译:基于Spark SQL的大规模数据查询索引

获取原文

摘要

Spark SQL lets spark programmers query structured data inside Spark programs using SQL statements. It provides spark programmers with great convenience to leverage the benefits of relational processing, and its internal RDD distributed processing also accelerates query on large data sets. However, Spark SQL is not designed for long-run services and its built-in data source would load data from storage system, such as HDFS and local file system, in each table scan without cache mechanism. Although users could keep data in memory using "cache" command explicitly, the data cached in memory is coarse grained. In this paper, we present an indexing structure which is a pluggable component of Spark SQL based on Apache Spark. Compared with Spark SQL, it has some additional advantages. Firstly, it allows users to create index of structured data to be processed, which speeds up the query performance greatly. Secondly, it enables programmers to load fine-grained data file of structured data into memory, which is flexible to load "hot data" into memory and to evict "cold data" out of memory.
机译:Spark SQL使Spark程序员可以使用SQL语句在Spark程序中查询结构化数据。它为Spark程序员提供了极大的便利,可以利用关系处理的好处,并且其内部的RDD分布式处理还可以加速对大型数据集的查询。但是,Spark SQL并不是为长期服务而设计的,它的内置数据源将在没有缓存机制的每次表扫描中从存储系统(例如HDFS和本地文件系统)加载数据。尽管用户可以使用“ cache”命令显式地将数据保留在内存中,但是缓存在内存中的数据是粗粒度的。在本文中,我们提出了一种索引结构,该结构是基于Apache Spark的Spark SQL的可插入组件。与Spark SQL相比,它具有一些其他优点。首先,它允许用户创建要处理的结构化数据的索引,从而极大地提高了查询性能。其次,它使程序员能够将结构化数据的细粒度数据文件加载到内存中,从而可以灵活地将“热数据”加载到内存中并从内存中逐出“冷数据”。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号