首页> 外文会议>International Conference on Big Data Analytics and Practices >Sakdas: A Python Package for Data Profiling and Data Quality Auditing
【24h】

Sakdas: A Python Package for Data Profiling and Data Quality Auditing

机译:Sakdas:用于数据分析和数据质量审核的Python软件包

获取原文

摘要

Data Profiling and data quality management become a more significant part of data engineering, which an essential part of ensuring that the system delivers quality information to users. In the last decade, data quality was considered to need more managing. Especially in the big data era that the data comes from many sources, many data types, and an enormous amount. Thus it makes the managing of data quality is more difficult and complicated. The traditional system was unable to respond as needed. The data quality managing software for big data was developed but often found in a high-priced, difficult to customize as needed, and mostly provide as GUI, which is challenging to integrate with other systems. From this problem, we have developed an opensource package for data quality managing. By using Python programming language, Which is a programming language that is widely used in the scientific and engineering field today. Because it is a programming language that is easy to read syntax, small, and has many additional packages to integrate. The software developed here is called “Sakdas” this package has been divided into three parts. The first part deals with data profiling provide a set of data analyses to generate a data profile, and this profile will help to define the data quality rules. The second part deals with data quality auditing that users can set their own data quality rules for data quality measurement. The final part deals with data visualizing that provides data profiling and data auditing report to improve the data quality. The results of the profiling and auditing services, the user can specify both the form of a report for self-review. Or in the form of JSON for use in post-process automation.
机译:数据分析和数据质量管理已成为数据工程的重要组成部分,这是确保系统向用户交付质量信息的重要组成部分。在过去的十年中,数据质量被认为需要更多的管理。尤其是在大数据时代,数据来自多种来源,多种数据类型,而且数量巨大。因此,这使得数据质量的管理更加困难和复杂。传统系统无法根据需要做出响应。开发了用于大数据的数据质量管理软件,但通常价格昂贵,难以按需定制,并且大多以GUI的形式提供,这很难与其他系统集成。针对这个问题,我们开发了一个用于数据质量管理的开源软件包。通过使用Python编程语言,这是一种在当今的科学和工程领域中广泛使用的编程语言。因为它是一种易于阅读的语法,但它的编程语言很小,并且有许多其他要集成的软件包。此处开发的软件称为“ Sakdas”,此软件包已分为三个部分。第一部分处理数据概要分析,提供了一组数据分析以生成数据概要文件,该概要文件将有助于定义数据质量规则。第二部分涉及数据质量审核,用户可以设置自己的数据质量规则以进行数据质量测量。最后一部分涉及数据可视化,该数据可视化提供数据概要分析和数据审核报告以提高数据质量。分析和审核服务的结果,用户可以指定两种形式的报告以供自我检查。或以JSON的形式用于后处理自动化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号