首页> 外文会议>IEEE International Congress on Big Data >QDrill: Query-Based Distributed Consumable Analytics for Big Data
【24h】

QDrill: Query-Based Distributed Consumable Analytics for Big Data

机译:QDrill:大数据的基于查询的分布式耗材分析

获取原文

摘要

Consumable analytics attempt to address the shortage of skilled data analysts in many organizations by offering analytic functionality in a form more familiar to in-house expertise. Providing consumable analytics for Big Data faces three main challenges. The first challenge is making the analytics algorithms run in a distributed fashion in order to analyze Big Data in a timely manner. The second challenge is providing an easy interface to allow in-house expertise to run these algorithms in a distributed fashion while minimizing the learning cycle and existing code rewrites. The third challenge is running the analytics on data of different formats stored on heterogeneous data stores. In this paper, we address these challenges in the proposed QDrill. We introduce the Analytics Adaptor extension for Apache Drill, a schema-free SQL query engine for non-relational storage. The Analytics Adaptor introduces the Distributed Analytics Query Language for invoking data mining algorithms from within the Drill standard SQL query statements. The adaptor allows using any sequential single-node data mining library (e.g. WEKA) and makes its algorithms run in a distributed fashion without having to rewrite them. We evaluate QDrill against Apache Mahout. The evaluation shows that QDrill outperforms Mahout in Updatable model training and scoring phase while almost keeping the same performance for Non-Updatable model training. QDrill is more scalable and offers an easier interface, no storage overhead and the whole algorithms repository of WEKA, with the ability to extend to use algorithms from other data mining libraries.
机译:消耗型分析试图通过以内部专家更熟悉的形式提供分析功能来解决许多组织中缺乏熟练数据分析师的问题。为大数据提供耗材分析面临三个主要挑战。第一个挑战是使分析算法以分布式方式运行,以便及时分析大数据。第二个挑战是提供一个简单的界面,以允许内部专家以分布式方式运行这些算法,同时最小化学习周期和现有代码重写。第三个挑战是对异构数据存储中存储的不同格式的数据进行分析。在本文中,我们将在拟议的QDrill中解决这些挑战。我们为Apache Drill引入了Analytics Adapter扩展,这是一种用于非关系存储的无模式SQL查询引擎。 Analytics Adapter引入了Distributed Analytics查询语言,用于从Drill标准SQL查询语句中调用数据挖掘算法。该适配器允许使用任何顺序的单节点数据挖掘库(例如WEKA),并使其算法以分布式方式运行而无需重写它们。我们针对Apache Mahout评估QDrill。评估显示,在可更新模型训练和评分阶段,QDrill优于Mahout,而对于不可更新模型训练,其性能几乎保持相同。 QDrill具有更高的可扩展性,提供了更简单的界面,无存储开销以及WEKA的整个算法存储库,并具有扩展能力以使用其他数据挖掘库中的算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号