...
首页> 外文期刊>Knowledge and information systems >Efficient processing of streaming updates with archived master data in near-real-time data warehousing
【24h】

Efficient processing of streaming updates with archived master data in near-real-time data warehousing

机译:在近实时数据仓库中高效处理带存档主数据的流更新

获取原文
获取原文并翻译 | 示例
           

摘要

In order to make timely and effective decisions, businesses need the latest information from data warehouse repositories. To keep these repositories up-to-date with respect to end user updates, near-real-time data integration is required. An important phase in near-real-time data integration is data transformation where the stream of updates is joined with disk-based master data. The stream-based algorithm MESHJOIN (Mesh Join) has been proposed to amortize disk access over fast streams. MESHJOIN makes no assumptions about the data distribution. In real-world applications, however, skewed distributions can be found, such as a stream of products sold, where certain products are sold more frequently than the remainder of the products. The question arises is how much does MESHJOIN lose in terms of performance by not adapting to data skew. In this paper we perform a rigorous experimental study analyzing the possible performance improvements while considering typical data distributions. For this purpose we design an algorithm Extended Hybrid Join (X-HYBRIDJOIN) that is complementary to MESHJOIN in that it can adapt to data skew and stores parts of the master data in memory permanently, reducing the disk access overhead significantly. We compare the performance of X-HYBRIDJOIN against the performance of MESHJOIN. We take several precautions to make sure the comparison is adequate and focuses on the utilization of data skew. The experiments show that considering data skew offers substantial room for performance gains that cannot be found in non-adaptive approaches such as MESHJOIN. We also present a cost model for X-HYBRIDJOIN, and based on that cost model, the algorithm is tuned.
机译:为了做出及时有效的决策,企业需要来自数据仓库存储库的最新信息。为了使这些存储库与最终用户更新保持最新,需要近实时的数据集成。接近实时数据集成的一个重要阶段是数据转换,其中更新流与基于磁盘的主数据结合在一起。已经提出了基于流的算法MESHJOIN(网格连接)来分摊快速流上的磁盘访问。 MESHJOIN不对数据分布进行任何假设。但是,在实际应用中,可能会发现歪斜的分布,例如销售的产品流,其中某些产品的销售频率高于其余产品。出现的问题是,由于不适应数据偏斜,MESHJOIN在性能方面损失了多少。在本文中,我们进行了严格的实验研究,分析了可能的性能改进,同时考虑了典型的数据分布。为此,我们设计了一种扩展混合联接(X-HYBRIDJOIN)算法,该算法与MESHJOIN互补,可以适应数据偏斜并将部分主数据永久存储在内存中,从而显着减少了磁盘访问开销。我们将X-HYBRIDJOIN的性能与MESHJOIN的性能进行了比较。我们采取了几种预防措施来确保比较是足够的,并着重于数据偏斜的利用。实验表明,考虑数据偏斜为性能提升提供了很大的空间,而在非自适应方法(如MESHJOIN)中找不到这种性能提升。我们还提出了X-HYBRIDJOIN的成本模型,并基于该成本模型对算法进行了调整。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号