【24h】

DeepClean: Data Cleaning via Question Asking

机译:深刻:通过问题清洁询问

获取原文

摘要

As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.
机译:作为数据分析管道中的一个关键任务,数据清洁是众所周知的人类劳动密集型和容易出错的。知识基本辅助数据清洁已证明了一个强大的寻找和修复数据缺陷的工具;然而,其适用性因知识库的自然局限而不可避免地界定。同时,虽然广泛的知识来源以自由文本语料库(例如,维基百科)的形式存在,但将它们转换为可通过现有数据清洁工具可用的格式,这可能是昂贵的并且容易出错,如果根本不可能。在这里,我们展示了深度思考,这是由自由文本知识源提供支持的第一端到端数据清洁框架。在高水平,深度通过其问答(QA)界面利用知识来源,通过迭代问题询问实现高质量的清洁。具体地,DeepClean在三个阶段检测和修理数据缺陷:(i)模式提取 - 它自动发现数据属性的语义类型以及它们的相关性; (ii)问题生成 - 它将每个数据元组转化为最小的验证问题; (iii)完成和修复 - 通过检查知识源返回的答案对数据值,标识错误情况并表明可能的修复。通过广泛的实证研究,我们证明了深度曲线适用于一系列域,并且可以有效修复各种数据缺陷,突出显示由自由文本知识来源提供的数据清理作为未来研究的有希望的方向。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号