Cecilia: an R package to automate data cleaning of administrative datasets

Alexandre Franco Garcia; Miro Palfy; Stacy Ann Vasquez

首页> 外文期刊>International Journal of Population Data Science >Cecilia: an R package to automate data cleaning of administrative datasets

【24h】

Cecilia: an R package to automate data cleaning of administrative datasets

机译：Cecilia：一个R包，用于自动清除管理数据集的数据

获取原文

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

IntroductionData linkage has considerable potential to improve health and society. Linking vast and detailed information across multiple administrative population-scaled data sources enhances the quality of existing data, empowers population health research, and produces objective evidence to inform policy decisions. In this context, data cleaning is crucial to minimise linkage errors. Objectives and ApproachAs dealing with the heterogeneity of administrative datasets is an acknowledged time-consuming task, the objective is creating a public and open-source R package to automate and report steps of data cleaning in a reproducible fashion. The package automatically assesses variables and reports relevant information and issues for linkage purposes, then cleans the dataset based on problems found, reassesses the variables and reports results again. It has a default cleaning procedure based on years of accumulated linkage knowledge and an interactive exploratory session to check variables individually. The report also includes all settings from both default and interactive session. ResultsThe package accurately detected, cleaned and reported potential linkage problems in variables such as names, addresses and dates in so far 15 actual populational datasets from multiple sources, with a diverse range of format, content, and inconsistencies. The entire process took minutes rather than hours. The reports correctly gathered, organised and presented all relevant information for linkage, in all distinct sections of the hyperlinked document, such as those related to the dataset, individual variables or settings used for cleaning. The different types of information included text, figures, data dictionaries, and frequency tables of detected issues, such as non-alphanumeric characters, annotation terms or suffixes and prefixes. The output datasets had all evaluated variables with cleaned data plus extra columns containing only issues themselves or problematic records. Conclusion/ImplicationsThe package accelerates the data cleaning of linkage variables, automating time-consuming steps, providing pertinent information for linkage as well as cleaned datasets. The complete process is time-efficient and reproducible. As the output dataset contains variables with cleaned data and detected issues, it allows assessment of the level of cleaning performed.

机译：简介数据链接在改善健康和社会方面具有巨大潜力。将多个行政人口规模的数据源之间的庞大而详细的信息链接起来，可以提高现有数据的质量，增强人口健康研究的能力，并提供客观的证据来为决策提供依据。在这种情况下，数据清理对于最小化链接错误至关重要。目标和方法由于处理管理数据集的异质性是一项公认的耗时任务，因此目标是创建一个公共的和开源的R包，以可重复的方式自动化和报告数据清理步骤。该软件包自动评估变量并报告相关信息和问题以进行链接，然后根据发现的问题清理数据集，重新评估变量并再次报告结果。它具有基于多年积累的连接知识的默认清洁程序，以及用于单独检查变量的交互式探索性会话。该报告还包括默认会话和交互式会话中的所有设置。结果该软件包准确地检测，清除并报告了到目前为止来自多个来源的15个实际人口数据集中变量，名称，地址和日期等变量中的潜在链接问题，这些数据具有各种格式，内容和不一致之处。整个过程耗时数分钟而不是数小时。报告在超链接文档的所有不同部分中正确收集，组织并显示了所有相关信息以进行链接，例如与数据集，单个变量或用于清洁的设置有关的信息。信息的不同类型包括文本，图形，数据字典以及检测到的问题的频率表，例如非字母数字字符，注释术语或后缀和前缀。输出数据集的所有评估变量均带有清除的数据以及仅包含问题本身或有问题的记录的额外列。结论/含义该软件包加速了链接变量的数据清理，自动化了耗时的步骤，为链接以及清理的数据集提供了相关信息。整个过程既省时又可重现。由于输出数据集包含具有清除数据和检测到的问题的变量，因此可以评估执行的清除级别。

著录项

来源
《International Journal of Population Data Science 》 |2018年第4期| 共页
作者
Alexandre Franco Garcia; Miro Palfy; Stacy Ann Vasquez;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类经济 ;
关键词

相似文献

外文文献
中文文献
专利

1. Cecilia: an R package to automate data cleaning of administrative datasets [J] . Alexandre Franco Garcia, Miro Palfy, Stacy Ann Vasquez International Journal of Population Data Science . 2018 ,第4期

机译：Cecilia：一个R包，用于自动清除管理数据集的数据
2. hardRain: An R package for quick, automated rainfall detection in ecoacoustic datasets using a threshold-based approach [J] . Metcalf Oliver C., Lees Alexander C., Barlow Jos, Ecological indicators . 2020 ,第Feba期

机译：hardRain：一种R包，使用基于阈值的方法对生态声学数据集进行快速，自动的降雨检测
3. Tips, guidelines and tools for managing multi-label datasets: The mldr.datasets R package and the Cometa data repository [J] . Charte Francisco, Rivera Antonio J., Charte David, Neurocomputing . 2018 ,第MAY10期

机译：管理多标签数据集的提示，准则和工具：mldr.datasets R软件包和Cometa数据存储库
4. Dataset Cleaning — A Cross Validation Methodology for Large Facial Datasets using Face Recognition [C] . Viktor Varkarakis, Peter Corcoran International Conference on Quality of Multimedia Experience . 2020

机译：数据集清洗-使用面部识别技术对大型面部数据集进行交叉验证的方法
5. Scaling the Technology Opportunity Analysis text data mining methodology: Data extraction, cleaning, online analytical processing analysis, and reporting of large multi-source datasets. [D] . George, Richard Peyton. 2006

机译：扩展技术机会分析文本数据挖掘方法：数据提取，清理，在线分析处理分析以及大型多源数据集的报告。
6. To clean or not to clean phenotypic datasets for outlier plants in genetic analyses? [O] . Santiago Alvarez Prado, Isabelle Sanchez, Llorenç Cabrera-Bosquet, -1

机译：在基因分析中要清除还是不清除异常植物的表型数据集？
7. hardRain: An R package for quick, automated rainfall detection in ecoacoustic datasets using a threshold-based approach [O] . Oliver C. Metcalf, Alexander C. Lees, Jos Barlow, 2020

机译：Hardrain：使用基于阈值的方法，在生态声学数据集中快速，自动降雨检测的R包

Cecilia: an R package to automate data cleaning of administrative datasets

摘要

著录项

相似文献

相关主题

期刊订阅