首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Meshing Streaming Updates with Persistent Data in an Active Data Warehouse
【24h】

Meshing Streaming Updates with Persistent Data in an Active Data Warehouse

机译:在活动数据仓库中使用持久性数据对流式更新进行网格划分

获取原文
获取原文并翻译 | 示例

摘要

Active data warehousing has emerged as an alternative to conventional warehousing practices in order to meet the high demand of applications for up-to-date information. In a nutshell, an active warehouse is refreshed online and thus achieves a higher consistency between the stored information and the latest data updates. The need for online warehouse refreshment introduces several challenges in the implementation of data warehouse transformations, with respect to their execution time and their overhead to the warehouse processes. In this paper, we focus on a frequently encountered operation in this context, namely, the join of a fast stream 5" of source updates with a disk-based relation R, under the constraint of limited memory. This operation lies at the core of several common transformations such as surrogate key assignment, duplicate detection, or identification of newly inserted tuples. We propose a specialized join algorithm, termed mesh join (MESHJOIN), which compensates for the difference in the access cost of the two join inputs by 1) relying entirely on fast sequential scans of R and 2) sharing the I/O cost of accessing R across multiple tuples of 5". We detail the MESHJOIN algorithm and develop a systematic cost model that enables the tuning of MESHJOIN for two objectives: maximizing throughput under a specific memory budget or minimizing memory consumption for a specific throughput. We present an experimental study that validates the performance of MESHJOIN on synthetic and real-life data. Our results verify the scalability of MESHJOIN to fast streams and large relations and demonstrate its numerous advantages over existing join algorithms.
机译:为了满足应用程序对最新信息的高需求,主动数据仓库已成为传统仓库实践的替代方案。简而言之,可以在线刷新活动仓库,从而在存储的信息和最新数据更新之间实现更高的一致性。在线仓库更新的需求在数据仓库转换的执行方面及其执行时间和仓库流程的开销方面带来了一些挑战。在本文中,我们重点介绍这种情况下经常遇到的操作,即在有限的内存约束下,将基于源磁盘的快速更新5“流与基于磁盘的关系R连接在一起。几种常见的转换,例如代理键分配,重复检测或新插入的元组的标识。我们提出了一种特殊的联接算法,称为网状联接(MESHJOIN),该算法将两个联接输入的访问成本差异补偿1)完全依赖于R和2的快速顺序扫描)在5“的多个元组之间共享访问R的I / O成本。我们详细介绍了MESHJOIN算法,并开发了一个系统成本模型,该模型可以针对以下两个目标进行MESHJOIN调整:在特定内存预算下最大化吞吐量或在特定吞吐量下最小化内存消耗。我们提供了一项实验研究,该实验验证了MESHJOIN在合成和现实数据上的性能。我们的结果验证了MESHJOIN到快速流和大关系的可伸缩性,并证明了其与现有联接算法相比的众多优势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号