【24h】

Bayesian Classifier Modeling for Dirty Data

机译:贝叶斯分类器脏数据建模

获取原文

摘要

Bayesian classifiers have been proven effective in many practical applications. To train a Bayesian classifier, important parameters such as prior and class conditional probabilities need to be learned from datasets. In practice, datasets are prone to errors due to dirty (missing, erroneous or duplicated) values, which will severely affect the model accuracy if no data cleaning task is enforced. However, cleaning the whole dataset is prohibitively laborious and thus infeasible for even medium-sized datasets. To this end, we propose to induce Bayes models by cleaning only small samples of the dataset. We derive confidence intervals as a function of sample size after data cleaning. In this way, the posterior probability is guaranteed to fall into the estimated confidence intervals with constant probability. Then, we design two strategies to compare the posterior probability intervals if overlap exists. Extension to semi-naive Bayes method is also addressed. Experimental results suggest that cleaning only a small number of samples can train satisfactory Bayesian models, offering significant, improvement in cost over cleaning all of the data and significant improvement on precision, recall and F-Measure over cleaning none of the data.
机译:在许多实际应用中,贝叶斯分类器已被证明是有效的。要培训贝叶斯分类器,需要从数据集中学习先前和类条件概率等重要参数。在实践中,如果没有强制执行数据清洁任务,则数据集可能因脏(缺失,错误或重复的)值而易于影响模型准确性。但是,清洁整个数据集对甚至中型数据集来说是非常费力的,因此对甚至是中型的数据集不可行。为此,我们建议通过仅清洁数据集的小样本来诱导贝叶斯模型。我们在数据清洁后的样本大小的函数中获得置信区间。以这种方式,保证后概率落入具有恒定概率的估计置信区间。然后,我们设计两种策略以比较后验概率间隔,如果存在重叠。还解决了半幼稚贝叶斯方法的延伸。实验结果表明,清洁只有少量样品可以培训令人满意的贝叶斯模型,提供重大,提高成本,在清洁所有数据和对精密,召回和F测量的显着改进,在清洁的情况下没有任何数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号