Detecting Mislabeled Data Using Supervised Machine Learning Techniques

机译：使用监督式机器学习技术检测标签错误的数据

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

A lot of data sets, gathered for instance during user experiments, are contaminated with noise. Some noise in the measured features is not much of a problem, it even increases the performance of many Machine Learning (ML) techniques. But for noise in the labels (mislabeled data) the situation is quite different, label noise deteriorates the performance of all ML techniques. The research question addressed in this paper is to what extent can one detect mislabeled data using a committee of supervised Machine Learning models. The committee under consideration consists of a Bayesian model, Random Forest, Logistic classifier, a Neural Network and a Support Vector Machine. This committee is applied to a given data set in several iterations of 5-fold Cross validation. If a data sample is misclassified by all committee members in all iterations (consensus) then it is tagged as mislabeled. This approach was tested on the Iris plant data set, which is artificially contaminated with mislabeled data. For this data set the precision of detecting mislabeled samples is 100% and the recall is approximately 5%. The approach was also tested on the Touch data set, a data set of naturalistic social touch gestures. It is known that this data set contains mislabeled data, but the amount is unknown. For this data set the proposed method achieved a precision of 70% and for almost all other tagged samples the corresponding touch gesture deviated a lot from the prototypical touch gesture. Overall the proposed method shows high potential for detecting mislabeled samples, but the precision on other data sets needs to be investigated.

机译：许多数据集（例如在用户实验期间收集的数据集）都被噪声污染了。被测特征中的一些杂音并不是什么大问题，它甚至可以提高许多机器学习（ML）技术的性能。但是对于标签中的噪音（错误贴标签的数据），情况则大不相同，标签噪音会降低所有机器学习技术的性能。本文解决的研究问题是，在有监督的机器学习模型委员会的监督下，可以在多大程度上检测出标签错误的数据。审议中的委员会由贝叶斯模型，随机森林，逻辑分类器，神经网络和支持向量机组成。该委员会在5倍交叉验证的多次迭代中应用于给定的数据集。如果数据样本在所有迭代（共识）中被所有委员会成员错误分类，则将其标记为标签错误。该方法已在鸢尾花植物数据集上进行了测试，该数据集被错误标记的数据人为污染。对于此数据集，检测标签错误的样本的精度为100％，召回率约为5％。该方法还在Touch数据集（自然主义社交触摸手势的数据集）上进行了测试。已知此数据集包含标签错误的数据，但数量未知。对于该数据集，所提出的方法达到了70％的精度，并且对于几乎所有其他带标签的样本，相应的触摸手势与原型触摸手势有很大的偏差。总体而言，所提出的方法显示出检测错误标记样品的巨大潜力，但需要研究其他数据集的精度。

著录项

来源
《International conference on human-computer interaction;International conference on augmented cognition》|2017年|571-581|共11页
会议地点
作者
Mannes Poel;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Mislabeled data; Supervised Machine Learning;

机译：标签数据错误;监督机器学习;

相似文献

外文文献
中文文献
专利

1. MDFP: A MACHINE LEARNING MODEL FOR DETECTING FAKE FACEBOOK PROFILES USING SUPERVISED AND UNSUPERVISED MINING TECHNIQUES [J] . Mohammed Basil Albayati, Ahmad Mousa Altamimi International journal of simulation: systems, science and technology . 2019,第1aaPagea1期

机译：MDFP：一种使用监督和未经监督的采矿技术来检测假脸书轮廓的机器学习模型
2. Supervised Machine Learning Algorithms for Bioelectromagnetics: Prediction Models and Feature Selection Techniques Using Data from Weak Radiofrequency Radiation Effect on Human and Animals Cells [J] . Malka N. Halgamuge International Journal of Environmental Research and Public Health . 2020,第12期

机译：生物电磁学监督机学习算法：预测模型和使用弱射频辐射辐射效应数据的特征选择技术
3. Preliminary Cardiac Disease Risk Prediction Based on Medical and Behavioural Data Set Using Supervised Machine Learning Techniques [J] . Thendral Puyalnithi, V. Madhu Viswanatham Indian Journal of Science and Technology . 2016,第31期

机译：基于医学和行为数据集的有监督机器学习技术的初步心脏病风险预测
4. Detecting Mislabeled Data Using Supervised Machine Learning Techniques [C] . Mannes Poel International Conference on Augmented Cognition . 2017

机译：使用监督机器学习技术检测错误标记的数据
5. Semi-Supervised Machine Learning Techniques for Classification of Evolving Data in Pattern Recognition =TECHNIQUES SEMI-SUPERVISéES D'APPRENTISSAGE MACHINE POUR LA CLASSIFICATION DES DONNéES EN éVOLUTION EN RECONNAISSANCE DE FORMES [D] . Tencer, Lukas. 2017

机译：半监督机器学习技术，用于模式识别中不断发展的数据分类=在表单识别中对数据进行分类的半监督机器学习技术
6. Supervised Machine Learning Algorithms for Bioelectromagnetics: Prediction Models and Feature Selection Techniques Using Data from Weak Radiofrequency Radiation Effect on Human and Animals Cells [O] . Malka N. Halgamuge 2020

机译：生物电磁学的有监督机器学习算法：使用弱射频辐射对人和动物细胞的数据预测模型和特征选择技术
7. Supervised Machine Learning Techniques to Detect TimeML Events in French and English [O] . Arnulphy, Béatrice, Claveau, Vincent, Tannier, Xavier, 2015

机译：监督机器学习技术以法语和英语检测TimeML事件

Detecting Mislabeled Data Using Supervised Machine Learning Techniques

摘要

著录项

相似文献

相关主题

期刊订阅