首页> 外文会议>AAAI Workshop >A Comparison of Event Models for Naive Bayes Text Classification
【24h】

A Comparison of Event Models for Naive Bayes Text Classification

机译:Naive Bayes文本分类事件模型的比较

获取原文

摘要

Recent work in text classification has used two different first-order probabilistic models for classification, both of which make the naive Bayes assumption. Some use a multi-variate Bernoulli model, that is, a Bayesian Network with no dependencies between words and binary word features (e.g. Larkey and Croft 1996; Koller and Sahami 1997). Others use a multinomial model, that is, a uni-gram language model with integer word counts (e.g. Lewis and Gale 1994; Mitchell 1997). This paper aims to clarify the confusion by describing the differences and details of these two models, and by empirically comparing their classification performance on five text corpora. We find that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes-providing on average a 27% reduction in error over the multi-variate Bernoulli model at any vocabulary size.
机译:最近的文本分类工作已经使用了两种不同的一阶概率模型来分类,这两者都使得天真贝叶斯的假设。一些使用多变化的Bernoulli模型,即贝叶斯网络,没有单词和二进制单词特征之间没有依赖性(例如Larkey和Croft 1996; Koller和Sahami 1997)。其他人使用多项式模型,即具有整数字数的Uni-gram语言模型(例如Lewis和Gale 1994; Mitchell 1997)。本文旨在通过描述这两种模型的差异和细节,并通过在五个文本语料库上统一地比较其分类性能来阐明混淆。我们发现多变型Bernoulli以小词汇量尺寸良好地表现良好,但多项式执行通常在更大的词汇量尺寸方面表现更好 - 在任何词汇量大小的多变化Bernoulli模型中的误差平均值27%的误差下降。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号