Meshing Streaming Updates with Persistent Data in an Active Data Warehouse

Polyzotis N.; Skiadopoulos S.; Vassiliadis P.; Simitsis A.; Frantzell N.

首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Meshing Streaming Updates with Persistent Data in an Active Data Warehouse

【24h】

Meshing Streaming Updates with Persistent Data in an Active Data Warehouse

机译：在活动数据仓库中使用持久性数据对流式更新进行网格划分

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Active data warehousing has emerged as an alternative to conventional warehousing practices in order to meet the high demand of applications for up-to-date information. In a nutshell, an active warehouse is refreshed online and thus achieves a higher consistency between the stored information and the latest data updates. The need for online warehouse refreshment introduces several challenges in the implementation of data warehouse transformations, with respect to their execution time and their overhead to the warehouse processes. In this paper, we focus on a frequently encountered operation in this context, namely, the join of a fast stream 5" of source updates with a disk-based relation R, under the constraint of limited memory. This operation lies at the core of several common transformations such as surrogate key assignment, duplicate detection, or identification of newly inserted tuples. We propose a specialized join algorithm, termed mesh join (MESHJOIN), which compensates for the difference in the access cost of the two join inputs by 1) relying entirely on fast sequential scans of R and 2) sharing the I/O cost of accessing R across multiple tuples of 5". We detail the MESHJOIN algorithm and develop a systematic cost model that enables the tuning of MESHJOIN for two objectives: maximizing throughput under a specific memory budget or minimizing memory consumption for a specific throughput. We present an experimental study that validates the performance of MESHJOIN on synthetic and real-life data. Our results verify the scalability of MESHJOIN to fast streams and large relations and demonstrate its numerous advantages over existing join algorithms.

机译：为了满足应用程序对最新信息的高需求，主动数据仓库已成为传统仓库实践的替代方案。简而言之，可以在线刷新活动仓库，从而在存储的信息和最新数据更新之间实现更高的一致性。在线仓库更新的需求在数据仓库转换的执行方面及其执行时间和仓库流程的开销方面带来了一些挑战。在本文中，我们重点介绍这种情况下经常遇到的操作，即在有限的内存约束下，将基于源磁盘的快速更新5“流与基于磁盘的关系R连接在一起。几种常见的转换，例如代理键分配，重复检测或新插入的元组的标识。我们提出了一种特殊的联接算法，称为网状联接（MESHJOIN），该算法将两个联接输入的访问成本差异补偿1）完全依赖于R和2的快速顺序扫描）在5“的多个元组之间共享访问R的I / O成本。我们详细介绍了MESHJOIN算法，并开发了一个系统成本模型，该模型可以针对以下两个目标进行MESHJOIN调整：在特定内存预算下最大化吞吐量或在特定吞吐量下最小化内存消耗。我们提供了一项实验研究，该实验验证了MESHJOIN在合成和现实数据上的性能。我们的结果验证了MESHJOIN到快速流和大关系的可伸缩性，并证明了其与现有联接算法相比的众多优势。

著录项

来源
《IEEE Transactions on Knowledge and Data Engineering》 |2008年第7期|p.976-991|共16页
作者
Polyzotis N.; Skiadopoulos S.; Vassiliadis P.; Simitsis A.; Frantzell N.;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
data analysis; data warehouses; MESHJOIN algorithm; active data warehouse; meshing streaming updates; online warehouse refreshment; persistent data; Data warehouse and repository; Query processing;

机译：数据分析;数据仓库;MESHJOIN算法;活动数据仓库;网格化流更新;在线仓库刷新;持久数据;数据仓库和存储库;查询处理;

相似文献

外文文献
中文文献
专利

1. Efficient processing of streaming updates with archived master data in near-real-time data warehousing [J] . M. Asif Naeem, Gillian Dobbie, Gerald Weber Knowledge and information systems . 2014,第3期

机译：在近实时数据仓库中高效处理带存档主数据的流更新
2. Scheduling Effective Cloud Updates in Streaming Data Warehouses using RECSS Algorithm [J] . D. S. Misbha, J. R. Jeba International Journal of Applied Engineering Research . 2016,第5aPta7期

机译：使用RECSS算法在流数据仓库中安排有效的云更新
3. Scalable Scheduling of Updates in Streaming Data Warehouses [J] . Golab L. Knowledge and Data Engineering, IEEE Transactions on . 2012,第6期

机译：流数据仓库中更新的可伸缩计划
4. A Partition-based Approach to Support Streaming Updates over Persistent Data in an Active Data Warehouse [C] . Abhirup Chakraborty, Ajit Singh International Symposium on Parallel Distributed Processing . 2009

机译：基于分区的方法来支持在活动数据仓库中的持久性数据中的流媒体更新
5. Data warehouse stream view update with multiple streaming. [D] . Ahamed, Jamal Uddin. 2005

机译：具有多个流的数据仓库流视图更新。
6. Automated Creation of Datamarts from a Clinical Data Warehouse Driven by an Active Metadata Repository [O] . Charles L. Rogerson, Paul H. Kohlmiller, Harris Stutman 1998

机译：由活动元数据存储库驱动从临床数据仓库自动创建数据集市
7. MESHJOIN*:An Algorithm Supporting Streaming Updates in a Real-time Data Warehouse [O] . 林子雨, 林琛, 冯少荣, 2010

机译：MESHJOIN *：一种支持实时数据仓库中流更新的算法

Meshing Streaming Updates with Persistent Data in an Active Data Warehouse

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅