Efficient processing of streaming updates with archived master data in near-real-time data warehousing

M. Asif Naeem; Gillian Dobbie; Gerald Weber

首页> 外文期刊>Knowledge and information systems >Efficient processing of streaming updates with archived master data in near-real-time data warehousing

【24h】

Efficient processing of streaming updates with archived master data in near-real-time data warehousing

机译：在近实时数据仓库中高效处理带存档主数据的流更新

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to end user updates, near-real-time data integration is required. An important phase in near-real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm MESHJOIN (Mesh Join) has been proposed to amortize disk access over fast streams. MESHJOIN makes no assumptions about the data distribution. In real-world applications, however, skewed distributions can be found, such as a stream of products sold, where certain products are sold more frequently than the remainder of the products. The question arises is how much does MESHJOIN lose in terms of performance by not adapting to data skew. In this paper we perform a rigorous experimental study analyzing the possible performance improvements while considering typical data distributions. For this purpose we design an algorithm Extended Hybrid Join (X-HYBRIDJOIN) that is complementary to MESHJOIN in that it can adapt to data skew and stores parts of the master data in memory permanently, reducing the disk access overhead significantly. We compare the performance of X-HYBRIDJOIN against the performance of MESHJOIN. We take several precautions to make sure the comparison is adequate and focuses on the utilization of data skew. The experiments show that considering data skew offers substantial room for performance gains that cannot be found in non-adaptive approaches such as MESHJOIN. We also present a cost model for X-HYBRIDJOIN, and based on that cost model, the algorithm is tuned.

机译：为了做出及时有效的决策，企业需要来自数据仓库存储库的最新信息。为了使这些存储库与最终用户更新保持最新，需要近实时的数据集成。接近实时数据集成的一个重要阶段是数据转换，其中更新流与基于磁盘的主数据结合在一起。已经提出了基于流的算法MESHJOIN（网格连接）来分摊快速流上的磁盘访问。 MESHJOIN不对数据分布进行任何假设。但是，在实际应用中，可能会发现歪斜的分布，例如销售的产品流，其中某些产品的销售频率高于其余产品。出现的问题是，由于不适应数据偏斜，MESHJOIN在性能方面损失了多少。在本文中，我们进行了严格的实验研究，分析了可能的性能改进，同时考虑了典型的数据分布。为此，我们设计了一种扩展混合联接（X-HYBRIDJOIN）算法，该算法与MESHJOIN互补，可以适应数据偏斜并将部分主数据永久存储在内存中，从而显着减少了磁盘访问开销。我们将X-HYBRIDJOIN的性能与MESHJOIN的性能进行了比较。我们采取了几种预防措施来确保比较是足够的，并着重于数据偏斜的利用。实验表明，考虑数据偏斜为性能提升提供了很大的空间，而在非自适应方法（如MESHJOIN）中找不到这种性能提升。我们还提出了X-HYBRIDJOIN的成本模型，并基于该成本模型对算法进行了调整。

著录项

来源
《Knowledge and information systems》 |2014年第3期|共23页
作者
M. Asif Naeem; Gillian Dobbie; Gerald Weber;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类自动化系统理论;
关键词
Near-real-time data warehousing; Stream-based join; Data transformation; Performance and tuning;

机译：近实时数据仓库;基于流的联接;数据转换;性能和调优;

相似文献

外文文献
中文文献
专利

1. Efficient processing of streaming updates with archived master data in near-real-time data warehousing [J] . M. Asif Naeem, Gillian Dobbie, Gerald Weber Knowledge and information systems . 2014,第3期

机译：在近实时数据仓库中高效处理带存档主数据的流更新
2. Parallel Star Join+DataIndexes: efficient query processing in data warehouses and OLAP [J] . Datta A., VanderMeer D., Ramamritham K. IEEE Transactions on Knowledge and Data Engineering . 2002,第6期

机译：并行Star Join + DataIndexes：数据仓库和OLAP中的高效查询处理
3. Scheduling Effective Cloud Updates in Streaming Data Warehouses using RECSS Algorithm [J] . D. S. Misbha, J. R. Jeba International Journal of Applied Engineering Research . 2016,第5aPta7期

机译：使用RECSS算法在流数据仓库中安排有效的云更新
4. An Innovative Lambda-Architecture-Based Data Warehouse Maintenance Framework for Effective and Efficient Near-Real-Time OLAP over Big Data [C] . Alfredo Cuzzocrea, Rim Moussa, Gianni Vercelli Big data - BigData 2018 . 2018

机译：基于Lambda体系结构的创新数据仓库维护框架，可有效，高效地对大数据进行近实时OLAP
5. Data warehouse stream view update with multiple streaming. [D] . Ahamed, Jamal Uddin. 2005

机译：具有多个流的数据仓库流视图更新。
6. Architecting the Data Loading Process for an i2b2 Research Data Warehouse: Full Reload versus Incremental Updating [O] . Andrew R. Post, Miao Ai, Akshatha Kalsanka Pai, 2017

机译：为i2b2研究数据仓库设计数据加载过程：完全重载与增量更新
7. Parallel Star Join + Data Indexes: efficient query processing in data warehouses and OLAP [O] . DATTA ANINDYA, VANDERMEER DEBRA, RAMAMRITHAM KRITHI 2002

机译：并行星形联接+数据索引：数据仓库和OLAP中的高效查询处理

Efficient processing of streaming updates with archived master data in near-real-time data warehousing

摘要

著录项

相似文献

相关主题

期刊订阅