DeepClean: Data Cleaning via Question Asking

机译：DeepClean：通过提问进行数据清理

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.

机译：作为数据分析管道中的一项关键任务，众所周知，数据清理是劳动密集型的并且容易出错。事实证明，以知识库为基础的数据清除功能是发现和修复数据缺陷的强大工具。但是，它的适用性不可避免地受到知识库的自然限制的限制。同时，尽管以自由文本语料库（例如，Wikipedia）的形式存在大量的知识源，但是将它们转换成现有数据清理工具可用的格式可能是昂贵的并且容易出错，即使不是根本不可能的。在这里，我们介绍DeepClean，这是第一个由自由文本知识源提供支持的端到端数据清理框架。在较高的层次上，DeepClean通过其问答（QA）界面利用知识资源，并通过迭代式提问来实现高质量的清理。具体来说，DeepClean在三个阶段检测和修复数据缺陷：（i）模式提取-它自动发现数据属性的语义类型及其相关性; （ii）问题生成-将每个数据元组转换为最少的验证问题集; （iii）完成和修复-通过根据数据值检查知识源返回的答案，它可以识别错误的情况并提出可能的解决方法。通过广泛的经验研究，我们证明DeepClean适用于一系列领域，并且可以有效修复各种数据缺陷，并强调了由自由文本知识源提供支持的数据清除，这是未来研究的有希望的方向。

著录项

来源
《IEEE International Conference on Data Science and Advanced Analytics》|2018年|283-292|共10页
会议地点
作者
Xinyang Zhang; Yujie Ji; Chanh Nguyen; Ting Wang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Cleaning; Encyclopedias; Electronic publishing; Internet; Maintenance engineering; Physics;

机译：清洁;百科全书;电子出版;互联网;维修工程;物理;

相似文献

外文文献
中文文献
专利

1. 'To Clean or Not to Clean?' Reducing Daily Routine Hotel Room Cleaning by Letting Tourists Answer This Question for Themselves [J] . Cvelbar Ljubica Knezevic, Gruen Bettina, Dolnicar Sara Journal of travel research . 2021,第1期

机译：“清洁或不清洗？”通过让游客为自己回答这个问题，减少日常常规酒店房间清洁
2. Parts Cleaning: The Answers to Your Cleaning Questions May Surprise You [J] . Doug Kaufman Engine Builder . 2015,第Apra期

机译：零件清洁：清洁问题的答案可能会让您感到惊讶
3. A challenging question: how clean is clean? [J] . International food hygiene . 2014,第3期

机译：一个具有挑战性的问题：清洁程度如何？
4. DeepClean: Data Cleaning via Question Asking [C] . Xinyang Zhang, Yujie Ji, Chanh Nguyen, IEEE International Conference on Data Science and Advanced Analytics . 2019

机译：深刻：通过问题清洁询问
5. Scaling the Technology Opportunity Analysis text data mining methodology: Data extraction, cleaning, online analytical processing analysis, and reporting of large multi-source datasets. [D] . George, Richard Peyton. 2006

机译：扩展技术机会分析文本数据挖掘方法：数据提取，清理，在线分析处理分析以及大型多源数据集的报告。
6. Association between Clean Delivery Kit Use Clean Delivery Practices and Neonatal Survival: Pooled Analysis of Data from Three Sites in South Asia [O] . Nadine Seward, David Osrin, Leah Li, 2012

机译：清洁交付工具包使用清洁交付实践与新生儿生存之间的关联：来自南亚三个站点的数据汇总分析
7. DeepClean: Self-Supervised Artefact Rejection for Intensive Care Waveform Data Using Deep Generative Learning [O] . Tom Edinburgh, Peter Smielewski, Marek Czosnyka, 2021

机译：深洁：使用深生成型学习的密集护理波形数据的自我监督的人工制品

DeepClean: Data Cleaning via Question Asking

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅