首页> 外文学位 >Probabilistic Topic Modeling and Classification Probabilistic PCA for Text Corpora.
【24h】

Probabilistic Topic Modeling and Classification Probabilistic PCA for Text Corpora.

机译:文本主题的概率主题建模和分类概率PCA。

获取原文
获取原文并翻译 | 示例

摘要

Topic modeling is one of the most common tools to analyze a large volume of unlabeled documents that are usually represented with bag-of-words. This thesis firstly discusses the connections between the exchangeability property of bag-of-words, popular topic modeling algorithms, and the de Finetti-Hewitt-Savage theorem. We show that these algorithms are special cases of this theorem and the exchangeability of words, rather than independence of words, is the sufficient condition for applying them. Tasks are then focused on the latent Dirichlet allocation (LDA) because of its higher modeling capability. The investigation of asymmetric priors for LDA and derivation of per-document topic distribution for unseen documents are also presented. Since topics are often denoted by multinomial distributions of words, the semantic meaning cannot be easily understood especially when people are not familiar with the background of the studying corpus. To address this problem, automatic topic labeling is proposed to automatically generate understandable topic labels.;Apart from the text of a corpus, there are usually some meta information accompanied for analyses, e.g. author name, date, category, etc. Integrating them with text documents during topic modeling not only enables better topic analysis but also causes more information to be found. For instance, an author's interests can be identified if his/her name occurrences are tightly coupled to some words. Based on LDA and the author-topic model, we propose a Bayesian model with Dirichlet priors to combine text and author information to identify topics and interests associated with the corpus and the authors, respectively. With both the topics and the interests, generalization of the model is significantly improved. We also propose a composite model to combine identified topics and interests so that the overall composite topics of a corpus can be derived. These composite topics have a desirable property that the correlation between them is lower and hence they can represent more diverse aspects of the corpus.;Text corpus analyses and processing are often performed in low dimensional spaces rather than high dimensional spaces formed by bag-of-words. Dimensionality reduction can be achieved with principle component analysis (PCA) or some other algorithms. Nevertheless, most of them are unsupervised and complementary information such as the labels of documents is often ignored. Even there are some supervised dimensionality reduction algorithms such as supervised probabilistic PCA, they treat labels as real numbers but not nominal categories. We propose the classification probabilistic PCA (CPPCA) to incorporate label information of documents, in which labels are treated as categories. Documents can be projected into a lower dimensional space where variances and labels are considered simultaneously. Semi-supervised version of this algorithm was applied to domain adaptation problems and experimental results show that CPPCA performs significantly better than unsupervised and supervised probabilistic PCA.
机译:主题建模是分析大量通常用词袋表示的未标记文档的最常用工具之一。本文首先讨论了词袋的可交换性,流行主题建模算法和de Finetti-Hewitt-Savage定理之间的联系。我们证明这些算法是该定理的特例,并且单词的可交换性而不是单词的独立性是应用它们的充分条件。由于其更高的建模能力,因此任务将重点放在潜在的Dirichlet分配(LDA)上。还介绍了LDA不对称先验的研究以及未见文档的按文档主题分布的推导。由于主题通常由单词的多项式分布表示,因此语义含义不容易理解,尤其是当人们不熟悉学习语料库的背景时。为了解决该问题,提出了自动主题标签以自动生成可理解的主题标签。除了语料库的文本之外,通常还存在一些伴随分析的元信息,例如。作者姓名,日期,类别等。在主题建模期间将它们与文本文档集成在一起,不仅可以实现更好的主题分析,还可以找到更多信息。例如,如果作者的名字出现与某些单词紧密相关,则可以确定作者的兴趣。基于LDA和作者主题模型,我们提出了一种具有Dirichlet先验的贝叶斯模型,以结合文本和作者信息来分别识别与语料库和作者相关的主题和兴趣。既有主题又有兴趣,模型的泛化能力大大提高。我们还提出了一个组合模型,将识别出的主题和兴趣组合在一起,以便可以导出语料库的整体组合主题。这些复合主题具有令人满意的特性,即它们之间的相关性较低,因此可以代表语料库的更多不同方面。文本语料库分析和处理通常在低维空间中进行,而不是在由袋式包装形成的高维空间中进行。话。降维可以通过主成分分析(PCA)或其他一些算法来实现。尽管如此,它们中的大多数都是不受监督的,并且补充信息(例如文档的标签)经常被忽略。即使有一些监督降维算法(例如监督概率PCA),它们也将标签视为实数而不是名义类别。我们提出了分类概率PCA(CPPCA)来合并文档的标签信息,其中标签被视为类别。可以将文档投影到较低维度的空间中,在该空间中可以同时考虑差异和标签。该算法的半监督版本应用于域自适应问题,实验结果表明,CPPCA的性能明显优于无监督和监督的概率PCA。

著录项

  • 作者

    Cheng, Chi Wa.;

  • 作者单位

    Hong Kong Baptist University (Hong Kong).;

  • 授予单位 Hong Kong Baptist University (Hong Kong).;
  • 学科 Computer Science.;Statistics.
  • 学位 Ph.D.
  • 年度 2011
  • 页码 117 p.
  • 总页数 117
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:44:33

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号