首页> 外文期刊>Open Journal of Statistics >A Fully Bayesian Sparse Probit Model for Text Categorization
【24h】

A Fully Bayesian Sparse Probit Model for Text Categorization

机译:用于文本分类的完全贝叶斯稀疏概率模型

获取原文
获取外文期刊封面目录资料

摘要

Nowadays a common problem when processing data sets with the large number of covariates compared to small sample sizes (fat data sets) is to estimate the parameters associated with each covariate. When the number of covariates far exceeds the number of samples, the parameter estimation becomes very difficult. Researchers in many fields such as text categorization deal with the burden of finding and estimating important covariates without overfitting the model. In this study, we developed a Sparse Probit Bayesian Model (SPBM) based on Gibbs sampling which utilizes double exponentials prior to induce shrinkage and reduce the number of covariates in the model. The method was evaluated using ten domains such as mathematics, the corpuses of which were downloaded from Wikipedia. From the downloaded corpuses, we created the TFIDF matrix corresponding to all domains and divided the whole data set randomly into training and testing groups of size 300. To make the model more robust we performed 50 re-samplings on selection of training and test groups. The model was implemented in R and the Gibbs sampler ran for 60 k iterations and the first 20 k was discarded as burn in. We performed classification on training and test groups by calculating P (yi = 1) and according to [1] [2] the threshold of 0.5 was used as decision rule. Our model’s performance was compared to Support Vector Machines (SVM) using average sensitivity and specificity across 50 runs. The SPBM achieved high classification accuracy and outperformed SVM in almost all domains analyzed.
机译:如今,当处理与小样本量(胖数据集)相比具有大量协变量的数据集时,一个常见的问题是估计与每个协变量关联的参数。当协变量的数量远远超过样本数量时,参数估计变得非常困难。文本分类等许多领域的研究人员都在不过度拟合模型的情况下处理了查找和估计重要协变量的负担。在这项研究中,我们基于Gibbs采样开发了稀疏Probit贝叶斯模型(SPBM),该模型利用双指数来诱发收缩并减少模型中的协变量数量。使用数学等十个领域对方法进行了评估,这些领域的语料库是从Wikipedia下载的。从下载的语料库中,我们创建了与所有域相对应的TFIDF矩阵,并将整个数据集随机分为大小为300的训练和测试组。为了使模型更强大,我们在选择训练和测试组时进行了50次重新采样。该模型在R中实现,Gibbs采样器运行了60 k次迭代,而前20 k被作为老化测试丢弃。我们通过计算P(yi = 1)并根据[1] [2]对训练和测试组进行分类]将阈值0.5用作决策规则。我们将模型的性能与支持向量机(SVM)进行了比较,使用了50次运行的平均灵敏度和特异性。在几乎所有分析领域中,SPBM均实现了较高的分类精度,并且优于SVM。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号