DATA++: An Automated Tool for Intelligent Data Augmentation Using Wikidata

机译：DATA ++：使用Wikidata进行智能数据增强的自动化工具

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In the present, technology has become a big influence that impacts the lives of many humans, with artificial intelligence being one of the most influential elements. Creative feature engineering is an important part of machine learning methodology that supports and manipulates existing data to make it work more efficiently by modifying dimensions of data. Pulling useful information from external sources and combining them, however, are cumbersome since data engineers need to manually find external data sources and process them. Therefore, the ability to modify and enrich existing data automatically, using external open data sources could prove crucial to data engineers and scientists looking to enrich their datasets. In this paper, we propose a method that automatically augments a given structured dataset, by inferencing relevant dimension from an external data source with respect to the target attribute. Specifically, our proposed algorithm first creates bloom filters for every instance of data items. Such filters are then used to retrieve relevant information from the linked open data source, which is later processed into additional columns in the target dataset. A case study of three real-world datasets using Wikidata as the external data source is used to empirically validate our proposed method on both regression and classification tasks. The experimental results show that the datasets augmented by our proposed algorithm yield correlation improvement of 23.11 % on average for the regression task, and ROC improvement of 86.50% for the classification task.

机译：目前，技术已成为影响许多人类生活的重大影响力，人工智能是最有影响力的元素之一。创意特征工程是机器学习方法学的重要组成部分，该方法支持并操纵现有数据，以通过修改数据维度使其更有效地工作。但是，由于数据工程师需要手动查找外部数据源并进行处理，因此从外部源中提取有用的信息并将其组合起来很麻烦。因此，使用外部开放数据源自动修改和丰富现有数据的能力对于希望丰富其数据集的数据工程师和科学家而言至关重要。在本文中，我们提出了一种方法，该方法通过根据目标属性从外部数据源推断相关维度来自动扩充给定的结构化数据集。具体而言，我们提出的算法首先为数据项的每个实例创建布隆过滤器。然后，使用此类过滤器从链接的开放数据源中检索相关信息，然后将其处理为目标数据集中的其他列。使用三个以Wikidata作为外部数据源的现实世界数据集的案例研究，以经验方式验证了我们在回归和分类任务上提出的方法。实验结果表明，通过我们提出的算法扩展的数据集，回归任务的相关性平均提高了23.11％，分类任务的ROC改进了86.50％。

著录项

来源
《International Joint Conference on Computer Science and Software Engineering》|2019年|91-96|共6页
会议地点
作者
Waran Taveekarn; Chatchanin Yimudom; Supisara Sukkanta; Steven Lynden; Wudhichart Sawangphol; Suppawong Tuarob;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Big Data; Task analysis; Machine learning; Predictive models; Tools; Prediction algorithms;

机译：大数据;任务分析;机器学习;预测模型;工具;预测算法;

相似文献

外文文献
中文文献
专利

1. A Wikidata-based tool for building and visualising narratives [J] . Daniele Metilli, Valentina Bartalesi, Carlo Meghini International journal on digital libraries . 2019,第4期

机译：基于Wikidata的工具，用于构建和可视化叙述
2. Data entry quality of double data entry vs automated form processing technologies: A cohort study validation of optical mark recognition and intelligent character recognition in a clinical setting [J] . Aksel Paulsen, Knut Harboe, Ingvild Dalen Health Science Reports . 2020,第4期

机译：双数据输入数据输入质量VS自动形式处理技术：临床环境中光学标记识别和智能字符识别的队列研究
3. Automated trend analysis of proteomics data using an intelligent data mining architecture [J] . James Malone, Ken McGarry, Chris Bowerman Expert systems with applications . 2006,第1期

机译：使用智能数据挖掘架构对蛋白质组学数据进行自动趋势分析
4. DATA++: An Automated Tool for Intelligent Data Augmentation Using Wikidata [C] . Waran Taveekarn, Chatchanin Yimudom, Supisara Sukkanta, International Joint Conference on Computer Science and Software Engineering . 2019

机译：Data ++：使用Wikidata的智能数据增强的自动化工具
5. Automated and Intelligent Programming of CNC Machine Tools =Samodejno in inteligentno programiranje CNC strojev [D] . Gjelaj, Afrim. 2015

机译：CNC机床的自动化和智能编程=自动和智能CNC加工编程
6. Data entry quality of double data entry vs automated form processing technologies: A cohort study validation of optical mark recognition and intelligent character recognition in a clinical setting [O] . Aksel Paulsen, Knut Harboe, Ingvild Dalen 2020

机译：双数据输入数据输入质量VS自动形式处理技术：临床环境中光学标记识别和智能字符识别的队列研究
7. Building automated vandalism detection tools for Wikidata [O] . Sarabadani, Amir, Halfaker, Aaron, Taraborelli, Dario 2017

机译：为Wikidata构建自动故意破坏检测工具

DATA++: An Automated Tool for Intelligent Data Augmentation Using Wikidata

摘要

著录项

相似文献

相关主题

期刊订阅