DeepClean: Data Cleaning via Question Asking

机译：深刻：通过问题清洁询问

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.

机译：作为数据分析管道中的一个关键任务，数据清洁是众所周知的人类劳动密集型和容易出错的。知识基本辅助数据清洁已证明了一个强大的寻找和修复数据缺陷的工具;然而，其适用性因知识库的自然局限而不可避免地界定。同时，虽然广泛的知识来源以自由文本语料库（例如，维基百科）的形式存在，但将它们转换为可通过现有数据清洁工具可用的格式，这可能是昂贵的并且容易出错，如果根本不可能。在这里，我们展示了深度思考，这是由自由文本知识源提供支持的第一端到端数据清洁框架。在高水平，深度通过其问答（QA）界面利用知识来源，通过迭代问题询问实现高质量的清洁。具体地，DeepClean在三个阶段检测和修理数据缺陷：（i）模式提取 - 它自动发现数据属性的语义类型以及它们的相关性; （ii）问题生成 - 它将每个数据元组转化为最小的验证问题; （iii）完成和修复 - 通过检查知识源返回的答案对数据值，标识错误情况并表明可能的修复。通过广泛的实证研究，我们证明了深度曲线适用于一系列域，并且可以有效修复各种数据缺陷，突出显示由自由文本知识来源提供的数据清理作为未来研究的有希望的方向。

著录项

来源
《IEEE International Conference on Data Science and Advanced Analytics》|2019年|696p|共10页
会议地点
作者
Xinyang Zhang; Yujie Ji; Chanh Nguyen; Ting Wang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类总体结构、系统结构;
关键词
Cleaning; Encyclopedias; Electronic publishing; Internet; Maintenance engineering; Physics;

机译：清洁;百科全书;电子出版;互联网;维护工程;物理学;

相似文献

外文文献
中文文献
专利

1. 'To Clean or Not to Clean?' Reducing Daily Routine Hotel Room Cleaning by Letting Tourists Answer This Question for Themselves [J] . Cvelbar Ljubica Knezevic, Gruen Bettina, Dolnicar Sara Journal of travel research . 2021,第1期

机译：“清洁或不清洗？”通过让游客为自己回答这个问题，减少日常常规酒店房间清洁
2. Parts Cleaning: The Answers to Your Cleaning Questions May Surprise You [J] . Doug Kaufman Engine Builder . 2015,第Apra期

机译：零件清洁：清洁问题的答案可能会让您感到惊讶
3. A challenging question: how clean is clean? [J] . International food hygiene . 2014,第3期

机译：一个具有挑战性的问题：清洁程度如何？
4. DeepClean: Data Cleaning via Question Asking [C] . Xinyang Zhang, Yujie Ji, Chanh Nguyen, IEEE International Conference on Data Science and Advanced Analytics . 2019

机译：深刻：通过问题清洁询问
5. Scaling the Technology Opportunity Analysis text data mining methodology: Data extraction, cleaning, online analytical processing analysis, and reporting of large multi-source datasets. [D] . George, Richard Peyton. 2006

机译：扩展技术机会分析文本数据挖掘方法：数据提取，清理，在线分析处理分析以及大型多源数据集的报告。
6. Association between Clean Delivery Kit Use Clean Delivery Practices and Neonatal Survival: Pooled Analysis of Data from Three Sites in South Asia [O] . Nadine Seward, David Osrin, Leah Li, 2012

机译：清洁交付工具包使用清洁交付实践与新生儿生存之间的关联：来自南亚三个站点的数据汇总分析
7. DeepClean: Self-Supervised Artefact Rejection for Intensive Care Waveform Data Using Deep Generative Learning [O] . Tom Edinburgh, Peter Smielewski, Marek Czosnyka, 2021

机译：深洁：使用深生成型学习的密集护理波形数据的自我监督的人工制品

DeepClean: Data Cleaning via Question Asking

摘要

著录项

相似文献

相关主题

期刊订阅