...
首页> 外文期刊>International Journal of Population Data Science >Cecilia: an R package to automate data cleaning of administrative datasets
【24h】

Cecilia: an R package to automate data cleaning of administrative datasets

机译:Cecilia:一个R包,用于自动清除管理数据集的数据

获取原文

摘要

IntroductionData linkage has considerable potential to improve health and society. Linking vast and detailed information across multiple administrative population-scaled data sources enhances the quality of existing data, empowers population health research, and produces objective evidence to inform policy decisions. In this context, data cleaning is crucial to minimise linkage errors. Objectives and ApproachAs dealing with the heterogeneity of administrative datasets is an acknowledged time-consuming task, the objective is creating a public and open-source R package to automate and report steps of data cleaning in a reproducible fashion. The package automatically assesses variables and reports relevant information and issues for linkage purposes, then cleans the dataset based on problems found, reassesses the variables and reports results again. It has a default cleaning procedure based on years of accumulated linkage knowledge and an interactive exploratory session to check variables individually. The report also includes all settings from both default and interactive session. ResultsThe package accurately detected, cleaned and reported potential linkage problems in variables such as names, addresses and dates in so far 15 actual populational datasets from multiple sources, with a diverse range of format, content, and inconsistencies. The entire process took minutes rather than hours. The reports correctly gathered, organised and presented all relevant information for linkage, in all distinct sections of the hyperlinked document, such as those related to the dataset, individual variables or settings used for cleaning. The different types of information included text, figures, data dictionaries, and frequency tables of detected issues, such as non-alphanumeric characters, annotation terms or suffixes and prefixes. The output datasets had all evaluated variables with cleaned data plus extra columns containing only issues themselves or problematic records. Conclusion/ImplicationsThe package accelerates the data cleaning of linkage variables, automating time-consuming steps, providing pertinent information for linkage as well as cleaned datasets. The complete process is time-efficient and reproducible. As the output dataset contains variables with cleaned data and detected issues, it allows assessment of the level of cleaning performed.
机译:简介数据链接在改善健康和社会方面具有巨大潜力。将多个行政人口规模的数据源之间的庞大而详细的信息链接起来,可以提高现有数据的质量,增强人口健康研究的能力,并提供客观的证据来为决策提供依据。在这种情况下,数据清理对于最小化链接错误至关重要。目标和方法由于处理管理数据集的异质性是一项公认的耗时任务,因此目标是创建一个公共的和开源的R包,以可重复的方式自动化和报告数据清理步骤。该软件包自动评估变量并报告相关信息和问题以进行链接,然后根据发现的问题清理数据集,重新评估变量并再次报告结果。它具有基于多年积累的连接知识的默认清洁程序,以及用于单独检查变量的交互式探索性会话。该报告还包括默认会话和交互式会话中的所有设置。结果该软件包准确地检测,清除并报告了到目前为止来自多个来源的15个实际人口数据集中变量,名称,地址和日期等变量中的潜在链接问题,这些数据具有各种格式,内容和不一致之处。整个过程耗时数分钟而不是数小时。报告在超链接文档的所有不同部分中正确收集,组织并显示了所有相关信息以进行链接,例如与数据集,单个变量或用于清洁的设置有关的信息。信息的不同类型包括文本,图形,数据字典以及检测到的问题的频率表,例如非字母数字字符,注释术语或后缀和前缀。输出数据集的所有评估变量均带有清除的数据以及仅包含问题本身或有问题的记录的额外列。结论/含义该软件包加速了链接变量的数据清理,自动化了耗时的步骤,为链接以及清理的数据集提供了相关信息。整个过程既省时又可重现。由于输出数据集包含具有清除数据和检测到的问题的变量,因此可以评估执行的清除级别。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号