首页> 外文会议>International joint conference on artificial intelligence >Tag-Weighted Topic Model for Mining Semi-Structured Documents
【24h】

Tag-Weighted Topic Model for Mining Semi-Structured Documents

机译:标记加权主题模型的半结构化文档挖掘

获取原文

摘要

In the last decade,latent Dirichlet allocation (LDA) successfully discovers the statistical distribution of the topics over a unstructured text corpus.Meanwhile,more and more document data come up with rich human-provided tag information during the evolution of the Internet,which called semistructured data.The semi-structured data contain both unstructured data (e.g.,plain text) and metadata,such as papers with authors and web pages with tags.In general,different tags in a document play different roles with their own weights.To model such semi-structured documents is nontrivial.In this paper,we propose a novel method to model tagged documents by a topic model,called Tag-Weighted Topic Model (TWTM).TWTM is a framework that leverages the tags in each document to infer the topic components for the documents.This allows not only to learn document-topic distributions,but also to infer the tag-topic distributions for text mining (e.g.,classification,clustering,and recommendations).Moreover,TWTM automatically infers the probabilistic weights of tags for each document.We present an efficient variational inference method with an EM algorithm for estimating the model parameters.The experimental results show that our TWTM approach outperforms the baseline algorithms over three corpora in document modeling and text classification.
机译:在过去的十年中,潜在的狄利克雷分配(LDA)成功地发现了主题在非结构化文本语料库中的统计分布。同时,随着Internet的发展,越来越多的文档数据带有丰富的人类提供的标签信息。半结构化数据。半结构化数据包含非结构化数据(例如纯文本)和元数据,例如带有作者的论文和带有标签的网页。通常,文档中的不同标签具有各自的权重。本文提出了一种通过主题模型(Tag-Weighted Topic Model,TWTM)对标签文档进行建模的新方法。TWTM是一种利用每个文档中的标签来推断文档的框架。文档的主题组成部分。这不仅可以学习文档主题的分布,还可以推断标签主题的分布以进行文本挖掘(例如,分类,聚类和推荐)。 WTM自动推断每个文档的标签概率权重。我们提出了一种有效的变分推理方法,并使用EM算法来估计模型参数。实验结果表明,在文档建模和文本分类中,我们的TWTM方法在三个语料库上均优于基线算法。 。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号