首页> 外文学位 >An Approach for Testing the Extract-Transform-Load Process in Data Warehouse Systems
【24h】

An Approach for Testing the Extract-Transform-Load Process in Data Warehouse Systems

机译:一种测试数据仓库系统中提取-转换-加载过程的方法

获取原文
获取原文并翻译 | 示例

摘要

Enterprises use data warehouses to accumulate data from multiple sources for data analysis and research. Since organizational decisions are often made based on the data stored in a data warehouse, all its components must be rigorously tested. In this thesis, we first present a comprehensive survey of data warehouse testing approaches, and then develop and evaluate an automated testing approach for validating the Extract-Transform-Load (ETL) process, which is a common activity in data warehousing.;In the survey we present a classification framework that categorizes the testing and evaluation activities applied to the different components of data warehouses. These approaches include both dynamic analysis as well as static evaluation and manual inspections. The classification framework uses information related to what is tested in terms of the data warehouse component that is validated, and how it is tested in terms of various types of testing and evaluation approaches. We discuss the specific challenges and open problems for each component and propose research directions.;The ETL process involves extracting data from source databases, transforming it into a form suitable for research and analysis, and loading it into a data warehouse. ETL processes can use complex one-to-one, many-to-one, and many-to-many transformations involving sources and targets that use different schemas, databases, and technologies. Since faulty implementations in any of the ETL steps can result in incorrect information in the target data warehouse, ETL processes must be thoroughly validated. In this thesis, we propose automated balancing tests that check for discrepancies between the data in the source databases and that in the target warehouse. Balancing tests ensure that the data obtained from the source databases is not lost or incorrectly modified by the ETL process. First, we categorize and define a set of properties to be checked in balancing tests. We identify various types of discrepancies that may exist between the source and the target data, and formalize three categories of properties, namely, completeness, consistency, and syntactic validity that must be checked during testing. Next, we automatically identify source-to-target mappings from ETL transformation rules provided in the specifications. We identify one-to-one, many-to-one, and many-to-many mappings for tables, records, and attributes involved in the ETL transformations. We automatically generate test assertions to verify the properties for balancing tests. We use the source-to-target mappings to automatically generate assertions corresponding to each property. The assertions compare the data in the target data warehouse with the corresponding data in the sources to verify the properties.;We evaluate our approach on a health data warehouse that uses data sources with different data models running on different platforms. We demonstrate that our approach can find previously undetected real faults in the ETL implementation. We also provide an automatic mutation testing approach to evaluate the fault finding ability of our balancing tests. Using mutation analysis, we demonstrated that our auto-generated assertions can detect faults in the data inside the target data warehouse when faulty ETL scripts execute on mock source data.
机译:企业使用数据仓库来收集来自多个来源的数据,以进行数据分析和研究。由于组织决策通常是基于存储在数据仓库中的数据做出的,因此必须严格测试其所有组件。在本文中,我们首先对数据仓库的测试方法进行了全面的概述,然后开发并评估了一种自动测试方法,以验证提取-转换-加载(ETL)过程,这是数据仓库中的一项常见活动。调查中,我们提供了一个分类框架,该框架对应用于数据仓库不同组件的测试和评估活动进行了分类。这些方法包括动态分析以及静态评估和手动检查。分类框架使用的信息与根据已验证的数据仓库组件进行测试的内容以及如何根据各种类型的测试和评估方法进行测试有关。我们讨论每个组件的具体挑战和未解决的问题并提出研究方向。ETL过程包括从源数据库中提取数据,将其转换为适合研究和分析的形式,并将其加载到数据仓库中。 ETL流程可以使用复杂的一对一,多对一和多对多转换,这些转换涉及使用不同架构,数据库和技术的源和目标。由于任何ETL步骤中的错误实施都可能导致目标数据仓库中的信息不正确,因此必须对ETL流程进行全面验证。在本文中,我们提出了自动平衡测试,以检查源数据库中的数据与目标仓库中的数据之间的差异。平衡测试可确保从源数据库获得的数据不会因ETL流程而丢失或被错误地修改。首先,我们对平衡测试中要检查的一组属性进行分类和定义。我们确定源数据和目标数据之间可能存在的各种类型的差异,并规范化三类属性,即完整性,一致性和语法有效性,必须在测试过程中进行检查。接下来,我们根据规范中提供的ETL转换规则自动识别源到目标的映射。我们确定涉及ETL转换的表,记录和属性的一对一,多对一和多对多映射。我们自动生成测试断言,以验证用于平衡测试的属性。我们使用源到目标的映射来自动生成与每个属性相对应的断言。这些断言将目标数据仓库中的数据与源中的相应数据进行比较,以验证属性。我们在健康数据仓库中评估我们的方法,该数据库使用在不同平台上运行的具有不同数据模型的数据源。我们证明了我们的方法可以在ETL实现中发现以前未检测到的实际故障。我们还提供了一种自动突变测试方法,以评估平衡测试的故障发现能力。使用变异分析,我们证明了当对模拟源数据执行错误的ETL脚本时,我们的自动生成的断言可以检测目标数据仓库内数据中的错误。

著录项

  • 作者

    Homayouni, Hajar.;

  • 作者单位

    Colorado State University.;

  • 授予单位 Colorado State University.;
  • 学科 Computer science.
  • 学位 M.S.
  • 年度 2018
  • 页码 97 p.
  • 总页数 97
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:53:14

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号