首页> 外文会议>International Joint Conference on Computer Science and Software Engineering >DATA++: An Automated Tool for Intelligent Data Augmentation Using Wikidata
【24h】

DATA++: An Automated Tool for Intelligent Data Augmentation Using Wikidata

机译:DATA ++:使用Wikidata进行智能数据增强的自动化工具

获取原文

摘要

In the present, technology has become a big influence that impacts the lives of many humans, with artificial intelligence being one of the most influential elements. Creative feature engineering is an important part of machine learning methodology that supports and manipulates existing data to make it work more efficiently by modifying dimensions of data. Pulling useful information from external sources and combining them, however, are cumbersome since data engineers need to manually find external data sources and process them. Therefore, the ability to modify and enrich existing data automatically, using external open data sources could prove crucial to data engineers and scientists looking to enrich their datasets. In this paper, we propose a method that automatically augments a given structured dataset, by inferencing relevant dimension from an external data source with respect to the target attribute. Specifically, our proposed algorithm first creates bloom filters for every instance of data items. Such filters are then used to retrieve relevant information from the linked open data source, which is later processed into additional columns in the target dataset. A case study of three real-world datasets using Wikidata as the external data source is used to empirically validate our proposed method on both regression and classification tasks. The experimental results show that the datasets augmented by our proposed algorithm yield correlation improvement of 23.11 % on average for the regression task, and ROC improvement of 86.50% for the classification task.
机译:目前,技术已成为影响许多人类生活的重大影响力,人工智能是最有影响力的元素之一。创意特征工程是机器学习方法学的重要组成部分,该方法支持并操纵现有数据,以通过修改数据维度使其更有效地工作。但是,由于数据工程师需要手动查找外部数据源并进行处理,因此从外部源中提取有用的信息并将其组合起来很麻烦。因此,使用外部开放数据源自动修改和丰富现有数据的能力对于希望丰富其数据集的数据工程师和科学家而言至关重要。在本文中,我们提出了一种方法,该方法通过根据目标属性从外部数据源推断相关维度来自动扩充给定的结构化数据集。具体而言,我们提出的算法首先为数据项的每个实例创建布隆过滤器。然后,使用此类过滤器从链接的开放数据源中检索相关信息,然后将其处理为目标数据集中的其他列。使用三个以Wikidata作为外部数据源的现实世界数据集的案例研究,以经验方式验证了我们在回归和分类任务上提出的方法。实验结果表明,通过我们提出的算法扩展的数据集,回归任务的相关性平均提高了23.11%,分类任务的ROC改进了86.50%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号