首页> 外文会议>International conference on very large data bases >Automating Large-Scale Data Quality Verification
【24h】

Automating Large-Scale Data Quality Verification

机译:自动化大规模数据质量验证

获取原文

摘要

Modern companies and institutions rely on data to guide every single business process and decision. Missing or incorrect information seriously compromises any decision process downstream. Therefore, a crucial, but. tedious task for everyone involved in data processing is to verify the quality of their data. We present a system for automating the verification of data quality at scale, which meets the requirements of production use cases. Our system provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables 'unit tests' for data. We efficiently execute the resulting constraint validation workload by translating it. to aggregation queries on Apache Spark. Our platform supports the incremental validation of data quality on growing datasets, and leverages machine learning, e.g., for enhancing constraint suggestions, for estimating the 'predictability' of a column, and for detecting anomalies in historic data quality time series. We discuss our design decisions, describe the resulting system architecture, and present an experimental evaluation on various datasets.
机译:现代公司和机构依靠数据来指导每个业务流程和决策。信息丢失或不正确会严重影响下游的任何决策过程。因此,至关重要。参与数据处理的每个人的繁琐任务是验证其数据的质量。我们提出了一种自动化的大规模数据质量验证系统,该系统可以满足生产用例的要求。我们的系统提供了一个声明性API,该API结合了常见的质量约束和用户定义的验证代码,从而启用了数据的“单元测试”。我们通过翻译有效地执行了结果约束验证工作量。在Apache Spark上聚合查询。我们的平台支持对不断增长的数据集进行数据质量的增量验证,并利用机器学习来增强约束建议,估计列的``可预测性''以及检测历史数据质量时间序列中的异常。我们讨论我们的设计决策,描述最终的系统架构,并对各种数据集进行实验评估。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号