首页> 外文会议>IEEE World Congress on Services >Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase
【24h】

Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase

机译:在Apache HBase中将大型来源数据集作为RDF图存储,索引和查询

获取原文

摘要

Provenance, which records the history of an in-silico experiment, has been identified as an important requirement for scientific workflows to support scientific discovery reproducibility, result interpretation, and problem diagnosis. Large provenance datasets are composed of many smaller provenance graphs, each of which corresponds to a single workflow execution. In this work, we explore and address the challenge of efficient and scalable storage and querying of large collections of provenance graphs serialized as RDF graphs in an Apache HBase database. Specifically, we propose: (i) novel storage and indexing techniques for RDF data in HBase that are better suited for provenance datasets rather than generic RDF graphs and (ii) novel SPARQL query evaluation algorithms that solely rely on indices to compute expensive join operations, make use of numeric values that represent triple positions rather than actual triples, and eliminate the need for intermediate data transfers over a network. The empirical evaluation of our algorithms using provenance datasets and queries of the University of Texas Provenance Benchmark confirms that our approach is efficient and scalable.
机译:记录计算机模拟实验历史的出处已被确定为科学工作流程的重要要求,以支持科学发现的可重复性,结果解释和问题诊断。大型出处数据集由许多较小的出处图组成,每个图都对应于一个工作流程执行。在这项工作中,我们探索并解决了高效且可扩展的存储和查询在Apache HBase数据库中序列化为RDF图的大量出处图集合的挑战。具体来说,我们建议:(i)HBase中用于RDF数据的新颖存储和索引技术,比通用RDF图更适合于来源数据集,并且(ii)仅依靠索引来计算昂贵的联接操作的新颖SPARQL查询评估算法,利用代表三重位置而不是实际三重位置的数值,并消除了通过网络进行中间数据传输的需要。使用出处数据集和德克萨斯大学出处基准测试查询对我们的算法进行的经验评估证实,我们的方法是有效且可扩展的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号