首页> 中文期刊>计算机应用研究 >基于中心词的上下文主题模型

基于中心词的上下文主题模型

     

摘要

潜在狄利克雷分配(LDA)主题模型是处理非结构化文档的有效工具.但是它是建立在词袋模型(bag of word,BOW)假设上的,这种假设把每一篇文档看成是单词的组合,既不考虑文档与文档之间的顺序关系,也不考虑单词与单词之间的顺序关系.同时针对现有的模型精度不高,提出了基于中心词的上下文主题模型.这种模型的思想是一篇文档中单词的主题与其附近若干单词的主题关系更为紧密.在计算每个单词的主题分布时,以这个词为中心,前后各扩展若干个单词作为窗口,然后对每个窗口进行计算.这种方法就会形成窗口与窗口之间的顺序,从而形成单词之间也是局部有序.同时由于每个单词的上下文信息不同,所以每个单词的主题分布与其所在文档中的位置有关.通过实验表明,基于中心词的上下文主题模型在未知数据集上具有更高的精度和收敛速度.%Latent Dirichlet allocation(LDA) topic model is an effective tool to process unstructured documents.But it is built on bag-of-words(BOW) model assumption,which regard each document as a combination of the word,neither the order relationship between documentsnor the order relationship between words is concerned.To improve current model's accuracy,this paper came up with the centroid-word based context topic model,this model was based on the theory that the topic of a word in a document had strong relationship of the word which near by.When calculating the topic distribution for each word,it regared the word as the center,extend before and after several words as the window,and then performed a calculation on each window.This approach would generate the corresponding order of each window,the same as the order of words,and because of the contexts of each word were different,so the distribution of each word had relationship with the location the word in the corresponding document.Experiments show that the centroid-word based context topic model has the better accuracy and convergence rate on unknown datasets.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号