首页> 外国专利> METHOD FOR ESTABLISHING INDEX ON HDFS-BASED SPARK-SQL BIG-DATA PROCESSING SYSTEM

METHOD FOR ESTABLISHING INDEX ON HDFS-BASED SPARK-SQL BIG-DATA PROCESSING SYSTEM

机译:在基于HDFS的SPARK-SQL大数据处理系统上建立索引的方法

摘要

Provided is a method for establishing an index on an HDFS-based Spark-SQL big-data processing system; by means of a SQL statement, an index is added to, an index is deleted from, data is inserted into, and data is deleted from an HDFS-based Spark-SQL big-data processing system; when data is being queried, automatically determining whether a query column has an index; if so, then searching for a file block contained in the index and filtering out file blocks not needing to be searched. after adding index functionality to Spark-SQL, it is possible to effectively increase query speed; in the case of a typical Spark-SQL data table, the size is 1000 GB, each file stored taking up 1 GB, the 1000 GB being divided into 1000 files; if an individual record is queried, the original approach would require scanning 1000 files; after establishing the index, scanning one file suffices, thus efficiency is increased by 1000 times. Under typical circumstances, and in view of a conventional relational database experience, a Spark-SQL database having an established index performs queries at a speed 100-10,000 times faster, or more, than a SQL statement having no index.
机译:提供了一种在基于HDFS的Spark-SQL大数据处理系统上建立索引的方法。通过SQL语句,在基于HDFS的Spark-SQL大数据处理系统中添加索引,从中删除索引,插入数据,以及删除数据。查询数据时,自动确定查询列是否有索引;如果是这样,则搜索索引中包含的文件块并过滤掉不需要搜索的文件块。向Spark-SQL添加索引功能后,可以有效提高查询速度;对于典型的Spark-SQL数据表,大小为1000 GB,每个存储的文件占用1 GB,将1000 GB分为1000个文件;如果查询个人记录,则原始方法将需要扫描1000个文件;建立索引后,扫描一个文件就足够了,因此效率提高了1000倍。在典型情况下,并且考虑到常规的关系数据库经验,具有已建立索引的Spark-SQL数据库执行查询的速度比没有索引的SQL语句快100-10,000倍或更高。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号