首页> 外文学位 >Optimizing RDF Analytical Queries on MapReduce.
【24h】

Optimizing RDF Analytical Queries on MapReduce.

机译:在MapReduce上优化RDF分析查询。

获取原文
获取原文并翻译 | 示例

摘要

The broadened use of Semantic Web technologies to enable data integration solutions across domains has increased the amount of semi-structured data on the Web represented using the Resource Description Framework (RDF). This has led to a growing interest to support analytics over semantically integrated RDF data warehouses, such as analysis of patient and drug profile data in life sciences for more effective clinical trial recruiting, and analysis of people-places data by e-government decision makers to improve citizen facilitation. In order to support large scale RDF analytics, it is crucial to investigate how to leverage parallel processing systems such as MapReduce and extended systems such as Apache Hadoop, Pig, and Hive.;An RDF analytical query consists of, (i) graph pattern matching to compute query-relevant subgraphs, (ii) grouping desired attributes, and (iii) aggregating values. In general, graph pattern matching translates to several join operations due to the fine-grained nature of the RDF data model. Many analytical queries involve multiple groupings and aggregations over slightly different graph patterns, further increasing the number of join operations. Evaluating such join-intensive workloads on existing platforms results in lengthy execution work ows. The challenge with such lengthy work ows is the I/O and network transfer costs due to the intermediate data produced across multiple map-reduce phases. Such costs can be significant for workloads that produce large intermediate results. Consequently, it is critical to develop techniques that enable more nimble execution of such work ows.;In this dissertation, we present a holistic approach to minimize the I/O and network transfer costs while processing RDF analytical queries on MapReduce. Given that RDF analytical queries often involve repeated computations over slightly different graph patterns, query plans that enable shared execution of common subpatterns are likely to compile into efficient MapReduce execution plans. To achieve this, we propose the following three optimization techniques that exploit sharing opportunities while evaluating RDF analytical queries.;• First, we propose an algebraic optimization approach that enables shared execution of overlapping graph patterns, thus eliminating the multiple phases of I/O and materialization involved in evaluation of multiple graph patterns in an RDF analytical query. A decoupling of the grouping and aggregation definitions is used to enable sharing of scans and computations across the graph pattern matching phases, as well as the grouping-aggregation phases. Such a rewriting results in an aggressive reduction in the length of execution work ows.;• Second, we propose an algebraic optimization of basic graph pattern queries using an alternative data model and algebra called the Nested TripleGroup Data Model and Algebra (NTGA). The NTGA query plans enable concurrent computation of star-shaped join subpatterns in a query, as opposed to existing systems that require a separate map-reduce cycle for each star subpattern. Thus, by enabling sharing of scans and computations across multiple star subpatterns, the NTGA query plans result in reduced numbers of map-reduce cycles.;• Third, we propose strategies for efficient management of intermediate results while evaluating graph pattern queries with multi-valued and unbound properties. For such queries, normalized representations of intermediate results using relational join operations introduce redundancies in results. To mitigate the effects of such redundancies, the NTGA's nested data model and lazy unnesting strategies enable sharing of data references, scans, and computations, thus reducing the footprint of intermediate results.;We extended Apache Pig's computational infrastructure to integrate the NTGA-based data model and operators along with the optimization strategies. Empirical evaluation on real-world and synthetic benchmark datasets demonstrate that the NTGA-based query plans result in shortened execution work ows with reduced footprint of intermediate results, thus minimizing the I/O and network transfers while processing RDF analytical queries on MapReduce.
机译:语义Web技术的广泛使用,以实现跨域的数据集成解决方案,从而增加了使用资源描述框架(RDF)表示的Web上的半结构化数据的数量。这引起了人们越来越多的兴趣来支持对语义集成的RDF数据仓库的分析,例如生命科学中的患者和药物档案数据分析,以更有效地进行临床试验招募,以及电子政务决策者对人地数据的分析,以改善公民便利化。为了支持大规模RDF分析,至关重要的是研究如何利用并行处理系统(例如MapReduce)和扩展系统(例如Apache Hadoop,Pig和Hive); RDF分析查询由(i)图形模式匹配组成计算与查询相关的子图,(ii)将所需属性分组,以及(iii)汇总值。通常,由于RDF数据模型的细粒度性质,图形模式匹配转换为几个联接操作。许多分析查询涉及略有不同的图形模式的多个分组和聚合,从而进一步增加了连接操作的数量。在现有平台上评估此类连接密集型工作负载会导致执行工作时间过长。如此漫长的工作量所带来的挑战是由于跨多个地图缩减阶段产生的中间数据而导致的I / O和网络传输成本。对于产生大量中间结果的工作负载而言,此类成本可能是巨大的。因此,开发能够更灵活地执行此类工作流的技术至关重要。在本论文中,我们提出了一种整体方法,以便在MapReduce上处理RDF分析查询时将I / O和网络传输成本降至最低。鉴于RDF分析查询通常涉及对略有不同的图形模式的重复计算,因此能够共享执行常见子模式的查询计划可能会编译为有效的MapReduce执行计划。为此,我们提出了以下三种优化技术,这些技术在评估RDF分析查询时利用了共享机会。•首先,我们提出了一种代数优化方法,该方法可以共享执行重叠图形模式,从而消除了I / O和I / O的多个阶段。在RDF分析查询中评估多个图形模式所涉及的实现。分组和聚集定义的解耦用于在图模式匹配阶段以及分组-聚集阶段之间共享扫描和计算。这样的重写可以大大减少执行工作量。•其次,我们提出了一种使用替代数据模型和代数(称为Nested TripleGroup数据模型和代数(NTGA))的基本图形模式查询的代数优化方法。 NTGA查询计划允许并发计算查询中的星形连接子模式,而现有系统则需要为每个星形子模式使用单独的map-reduce循环。因此,通过允许跨多个星形子模式共享扫描和计算,NTGA查询计划可减少映射减少循环的次数。;•第三,我们提出了在评估多值图形模式查询的同时有效管理中间结果的策略和未绑定的属性。对于此类查询,使用关系联接操作的中间结果的规范化表示会在结果中引入冗余。为了减轻这种冗余的影响,NTGA的嵌套数据模型和惰性嵌套策略可实现数据引用,扫描和计算的共享,从而减少了中间结果的占用空间。;我们扩展了Apache Pig的计算基础架构,以集成基于NTGA的数据模型和运算符以及优化策略。对现实世界和综合基准数据集的经验评估表明,基于NTGA的查询计划可缩短执行工作量,并减少中间结果的占用量,从而在处理MapReduce上的RDF分析查询时最大程度地减少了I / O和网络传输。

著录项

  • 作者

    Ravindra, Padmashree.;

  • 作者单位

    North Carolina State University.;

  • 授予单位 North Carolina State University.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 216 p.
  • 总页数 216
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号