In most probabilistic topic models, a document is viewed as a collection oftokens and each token is a variable whose values are all the words in avocabulary. One exception is hierarchical latent tree models (HLTMs), where adocument is viewed as a binary vector over the vocabulary and each word isregarded as a binary variable. The use of word variables allows the detectionand representation of patterns of word co-occurrences and co-occurrences ofthose patterns qualitatively using multiple levels of latent variables, andnaturally leads to a method for hierarchical topic detection. In this paper, weassume that an HLTM has been learned from binary data and we extend it to takeword frequencies into consideration. The idea is to replace each binary wordvariable with a real-valued variable that represents the relative frequency ofthe word in a document. A document generation process is proposed and analgorithm is given for estimating the model parameters by inverting thegeneration process. Empirical results show that our method significantlyoutperforms the commonly-used LDA-based methods for hierarchical topicdetection, in terms of model quality and meaningfulness of topics and topichierarchies.
展开▼