首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Dynamic Data Exchange in Distributed RDF Stores
【24h】

Dynamic Data Exchange in Distributed RDF Stores

机译:分布式RDF存储中的动态数据交换

获取原文
获取原文并翻译 | 示例

摘要

When RDF datasets become too large to be managed by centralised systems, they are often distributed in a cluster of shared-nothing servers, and queries are answered using a distributed join algorithm. Although such solutions have been extensively studied in relational and RDF databases, we argue that existing approaches exhibit two drawbacks. First, they usually decide statically(i.e., at query compile time) how to shuffle the data, which can lead to missed opportunities for local computation. Second, they often materialise large intermediate relations whose size is determined by the entire dataset (and not the data stored in each server), so these relations can easily exceed the memory of individual servers. As a possible remedy, we present a novel distributed join algorithm for RDF. Our approach decides when to shuffle data dynamically, which ensures that query answers that can be wholly produced within a server involve only local computation. It also uses a novel flow control mechanism to ensure that every query can be answered even if each server has a bounded amount of memory that is much smaller than the intermediate relations. We complement our algorithm with a new query planning approach that balances the cost of communication against the cost of local processing at each server. Moreover, as in several existing approaches, we distribute RDF data using graph partitioning so as to maximise local computation, but we refine the partitioning algorithm to produce more balanced partitions. We show empirically that our techniques can outperform the state of the art by orders of magnitude in terms of query evaluation times, network communication, and memory use. In particular, bounding the memory use in individual servers can mean the difference between success and failure for answering queries with large answer sets.
机译:当RDF数据集变得太大而无法由集中式系统管理时,它们通常分布在无共享服务器的群集中,并且使用分布式联接算法来回答查询。尽管已经在关系数据库和RDF数据库中广泛研究了此类解决方案,但我们认为现有方法存在两个缺点。首先,他们通常会静态决定(即在查询编译时)如何对数据进行混洗,这可能会导致丢失本地计算的机会。其次,它们通常会实现大型的中间关系,其大小由整个数据集(而不是每个服务器中存储的数据)确定,因此这些关系很容易超过单个服务器的内存。作为一种可能的解决方法,我们提出了一种新颖的RDF分布式联接算法。我们的方法决定何时动态地随机整理数据,以确保可以在服务器内完全生成的查询答案仅涉及本地计算。它还使用一种新颖的流控制机制来确保即使每个服务器都具有比中间关系小得多的有限内存量,也可以回答每个查询。我们用一种新的查询计划方法来补充我们的算法,该方法可以在每台服务器的通信成本与本地处理成本之间取得平衡。此外,如同在几种现有方法中一样,我们使用图分区来分配RDF数据以最大化本地计算,但是我们改进了分区算法以产生更平衡的分区。我们从经验上证明,在查询评估时间,网络通信和内存使用方面,我们的技术可以比现有技术好几个数量级。特别是,限制单个服务器中的内存使用量可能意味着在回答具有较大答案集的查询时成功与失败之间的区别。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号