【24h】

Bayesian Classifier Modeling for Dirty Data

机译:脏数据的贝叶斯分类器建模

获取原文

摘要

Bayesian classifiers have been proven effective in many practical applications. To train a Bayesian classifier, important parameters such as prior and class conditional probabilities need to be learned from datasets. In practice, datasets are prone to errors due to dirty (missing, erroneous or duplicated) values, which will severely affect the model accuracy if no data cleaning task is enforced. However, cleaning the whole dataset is prohibitively laborious and thus infeasible for even medium-sized datasets. To this end, we propose to induce Bayes models by cleaning only small samples of the dataset. We derive confidence intervals as a function of sample size after data cleaning. In this way, the posterior probability is guaranteed to fall into the estimated confidence intervals with constant probability. Then, we design two strategies to compare the posterior probability intervals if overlap exists. Extension to semi-naive Bayes method is also addressed. Experimental results suggest that cleaning only a small number of samples can train satisfactory Bayesian models, offering significant, improvement in cost over cleaning all of the data and significant improvement on precision, recall and F-Measure over cleaning none of the data.
机译:贝叶斯分类器已被证明在许多实际应用中有效。要训​​练贝叶斯分类器,需要从数据集中学习重要参数,例如先验条件和分类条件概率。在实践中,数据集由于值脏(丢失,错误或重复)而容易出错,如果不执行数据清理任务,则会严重影响模型的准确性。但是,清理整个数据集非常费力,因此即使对于中等大小的数据集也不可行。为此,我们建议通过仅清除数据集的小样本来导出贝叶斯模型。数据清洗后,我们得出置信区间作为样本量的函数。以这种方式,保证了后验概率以恒定的概率落入估计的置信区间。然后,我们设计了两种策略来比较存在重叠的后验概率区间。还解决了对半朴素贝叶斯方法的扩展。实验结果表明,仅清洗少量样品就可以训练出令人满意的贝叶斯模型,与清洗所有数据相比,可以显着提高成本,并且与不清洗任何数据相比,可以显着提高精度,召回率和F-Measure。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号