Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers

Wang Z.; Tsai C.-F.; Lin W.-C.

首页> 外文期刊>Data technologies and applications >Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers

【24h】

Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers

机译：数据清洗问题在课堂上不平衡数据集:实例选择和归责缺失值为看到下面成了一个分类器

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques, which aim to identify anomalies as the minority class from the normal data as the majority class, are one representative solution for class imbalanced datasets. Since one-class classifiers are trained using only normal data to create a decision boundary for later anomaly detection, the quality of the training set, i.e. the majority class, is one key factor that affects the performance of one-class classifiers. Design/methodology/approach: In this paper, we focus on two data cleaning or preprocessing methods to address class imbalanced datasets. The first method examines whether performing instance selection to remove some noisy data from the majority class can improve the performance of one-class classifiers. The second method combines instance selection and missing value imputation, where the latter is used to handle incomplete datasets that contain missing values. Findings: The experimental results are based on 44 class imbalanced datasets; three instance selection algorithms, including IB3, DROP3 and the GA, the CART decision tree for missing value imputation, and three one-class classifiers, which include OCSVM, IFOREST and LOF, show that if the instance selection algorithm is carefully chosen, performing this step could improve the quality of the training data, which makes one-class classifiers outperform the baselines without instance selection. Moreover, when class imbalanced datasets contain some missing values, combining missing value imputation and instance selection, regardless of which step is first performed, can maintain similar data quality as datasets without missing values. Originality/value: The novelty of this paper is to investigate the effect of performing instance selection on the performance of one-class classifiers, which has never been done before. Moreover, this study is the first attempt to consider the scenario of missing values that exist in the training set for training one-class classifiers. In this case, performing missing value imputation and instance selection with different orders are compared.

机译：类不平衡学习,它存在于许多域问题的数据集,是一个重要的研究在数据挖掘和机器学习的话题。看到下面成了分类技术,旨在识别异常的少数类的正常数据的多数类,是一个代表类不平衡的解决方案数据集。只使用正常的数据创建一个决定边界后异常检测,质量的训练集,即多数类,影响性能的一个关键因素看到下面成了一个分类器。设计/方法/方法:在这篇文章中,我们关注两个数据清洗和预处理address类不平衡数据集的方法。第一种方法检查是否执行实例选择删除一些嘈杂的数据多数类可以改进的性能看到下面成了一个分类器。实例选择和缺失值归责,后者用于处理不完整的在哪里含有缺失值的数据集。实验结果是基于44类不平衡数据集;算法,包括IB3 DROP3和遗传算法CART决策树对缺失值归责,和三个看到下面成了一个分类器,包括OCSVM, IFOREST LOF,表明如果实例选择算法是精心挑选的,执行这一步可以改善的质量训练数据,这使得看到下面成了分类器比基线实例的选择。不平衡数据集包含一些缺失值,结合缺失值污名和实例选择,不管这一步是第一执行时,可以保持类似的数据质量数据集没有缺失值。创意/价值:本文的新颖调查的影响执行实例选择在看到下面成了的性能分类器,从未做过。此外,本研究是第一次尝试考虑缺失值的场景在训练集训练看到下面成了一个存在分类器。选择与价值归责和实例比较不同的订单。

著录项

来源
《Data technologies and applications》 |2021年第5期|771-787|共17页
作者
Wang Z.; Tsai C.-F.; Lin W.-C.;
展开▼
作者单位

Faculty of Education East China Normal University;

Information Management National Central University;

Information Management Chang Gung University;

展开▼
收录信息
原文格式 PDF
正文语种英语
中图分类
关键词
missing value; Classifiers; Datasetdata miningdata cleaningArtificial IntelligenceMachine Learningselection algorithmDecision Boundary;

机译：缺失值;分类;Datasetdata miningdatacleaningArtificial IntelligenceMachineLearningselection algorithmDecision边界;

相似文献

外文文献
中文文献

1. MAKING SUPERCRITICAL CO{sub}2 CLEANING WORK: PROPER SELECTION OF CO-SOLVENTS AND OTHER ISSUES [C] . Akshey Sehgal International Symposium on Cleaning Technology in Semiconductor Device Manufacturing . 2004

机译：MAKING SUPERCRITICAL CO{sub}2 CLEANING WORK: PROPER SELECTION OF CO-SOLVENTS AND OTHER ISSUES
2. An Improved DINEOF Algorithm for Filling Missing Values in Spatio-Temporal Sea Surface Temperature Data [O] . Ping B., Su, F. Z., Meng, Y. S. 2016

机译：an Improved DINEOF algorithm for Filling missing Values in spatio-Temporal sea surface Temperature Data

Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers

摘要

著录项

相似文献

相关主题

期刊订阅