Bayesian Classifier Modeling for Dirty Data

机译：脏数据的贝叶斯分类器建模

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Bayesian classifiers have been proven effective in many practical applications. To train a Bayesian classifier, important parameters such as prior and class conditional probabilities need to be learned from datasets. In practice, datasets are prone to errors due to dirty (missing, erroneous or duplicated) values, which will severely affect the model accuracy if no data cleaning task is enforced. However, cleaning the whole dataset is prohibitively laborious and thus infeasible for even medium-sized datasets. To this end, we propose to induce Bayes models by cleaning only small samples of the dataset. We derive confidence intervals as a function of sample size after data cleaning. In this way, the posterior probability is guaranteed to fall into the estimated confidence intervals with constant probability. Then, we design two strategies to compare the posterior probability intervals if overlap exists. Extension to semi-naive Bayes method is also addressed. Experimental results suggest that cleaning only a small number of samples can train satisfactory Bayesian models, offering significant, improvement in cost over cleaning all of the data and significant improvement on precision, recall and F-Measure over cleaning none of the data.

机译：贝叶斯分类器已被证明在许多实际应用中有效。要训练贝叶斯分类器，需要从数据集中学习重要参数，例如先验条件和分类条件概率。在实践中，数据集由于值脏（丢失，错误或重复）而容易出错，如果不执行数据清理任务，则会严重影响模型的准确性。但是，清理整个数据集非常费力，因此即使对于中等大小的数据集也不可行。为此，我们建议通过仅清除数据集的小样本来导出贝叶斯模型。数据清洗后，我们得出置信区间作为样本量的函数。以这种方式，保证了后验概率以恒定的概率落入估计的置信区间。然后，我们设计了两种策略来比较存在重叠的后验概率区间。还解决了对半朴素贝叶斯方法的扩展。实验结果表明，仅清洗少量样品就可以训练出令人满意的贝叶斯模型，与清洗所有数据相比，可以显着提高成本，并且与不清洗任何数据相比，可以显着提高精度，召回率和F-Measure。

著录项

来源
《Pacific Rim international conference on artificial intelligence》|2019年|66-79|共14页
会议地点
作者
Hongya Wang; Weidong Cheng; Kaiyan Guo; Yingyuan Xiao; Zhenyu Liu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Bayesian classifiers; Data cleaning; Probability intervals;

机译：贝叶斯分类器;数据清理;概率区间;

相似文献

外文文献
中文文献
专利

1. Bayesian model averaging of Bayesian network classifiers over multiple node-orders: application to sparse datasets [J] . Kyu-Baek Hwang, Byoung-Tak Zhang IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics . 2005,第6期

机译：多节点顺序上贝叶斯网络分类器的贝叶斯模型平均：应用于稀疏数据集
2. Temporal Bayesian classifiers for modelling muscular dystrophy expression data [J] . Allan Tucker, Peter. A.C. t Hoen, Veronica Vinciotti, Intelligent data analysis . 2006,第5期

机译：用于肌肉营养不良表达数据建模的时间贝叶斯分类器
3. Galaxy Merger Rates up to z?～?3 Using a Bayesian Deep Learning Model: A Major-merger Classifier Using IllustrisTNG Simulation Data [J] . Leonardo Ferreira, Christopher J. Conselice, Kenneth Duncan, The Astrophysical journal . 2020,第2期

机译：Galaxy合并率最多Z？〜？3使用贝叶斯深度学习模型：使用IllustrySng模拟数据的主要合并分类器
4. Bayesian Classifier Modeling for Dirty Data [C] . Hongya Wang, Weidong Cheng, Kaiyan Guo, Pacific Rim international conference on artificial intelligence . 2019

机译：贝叶斯分类器脏数据建模
5. Development of a combined GIS, neural network and Bayesian classifier methodology for classifying remotely sensed data. [D] . Schneider, Claudio Albert. 2002

机译：结合了GIS，神经网络和贝叶斯分类器方法，对遥感数据进行分类。
6. MCMC implementation of the optimal Bayesian classifier for non-Gaussian models: model-based RNA-Seq classification [O] . Jason M Knight, Ivan Ivanov, Edward R Dougherty 2014

机译：非高斯模型的最佳贝叶斯分类器的MCMC实现：基于模型的RNA-Seq分类
7. Bayesian model averaging of Bayesian network classifiers over multiple node-orders: application to sparse datasets [O] . Kyu-baek Hwang, Byoung-tak Zhang 1982

机译：贝叶斯模型在多个节点顺序上对贝叶斯网络分类器进行平均：应用于稀疏数据集
8. Peering through a Dirty Window: A Bayesian Approach to Making Mine Detection Decisions from Noisy Data. [R] . Kercel, S. W. 1998

机译：窥视肮脏的窗口：贝叶斯方法从噪声数据中做出探雷决策。

Bayesian Classifier Modeling for Dirty Data

摘要

著录项

相似文献

相关主题

期刊订阅