【24h】

DeepClean: Data Cleaning via Question Asking

机译:DeepClean:通过提问进行数据清理

获取原文

摘要

As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.
机译:作为数据分析管道中的一项关键任务,众所周知,数据清理是劳动密集型的并且容易出错。事实证明,以知识库为基础的数据清除功能是发现和修复数据缺陷的强大工具。但是,它的适用性不可避免地受到知识库的自然限制的限制。同时,尽管以自由文本语料库(例如,Wikipedia)的形式存在大量的知识源,但是将它们转换成现有数据清理工具可用的格式可能是昂贵的并且容易出错,即使不是根本不可能的。在这里,我们介绍DeepClean,这是第一个由自由文本知识源提供支持的端到端数据清理框架。在较高的层次上,DeepClean通过其问答(QA)界面利用知识资源,并通过迭代式提问来实现高质量的清理。具体来说,DeepClean在三个阶段检测和修复数据缺陷:(i)模式提取-它自动发现数据属性的语义类型及其相关性; (ii)问题生成-将每个数据元组转换为最少的验证问题集; (iii)完成和修复-通过根据数据值检查知识源返回的答案,它可以识别错误的情况并提出可能的解决方法。通过广泛的经验研究,我们证明DeepClean适用于一系列领域,并且可以有效修复各种数据缺陷,并强调了由自由文本知识源提供支持的数据清除,这是未来研究的有希望的方向。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号