首页> 外文期刊>Computer speech and language >Unsupervised language model adaptation using LDA-based mixture models and latent semantic marginals
【24h】

Unsupervised language model adaptation using LDA-based mixture models and latent semantic marginals

机译:使用基于LDA的混合模型和潜在语义边际进行无监督语言模型自适应

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we present unsupervised language model (LM) adaptation approaches using latent Dirichlet allocation (LDA) and latent semantic marginals (LSM). The LSM is the unigram probability distribution over words that are calculated using LDA-adapted unigram models. The LDA model is used to extract topic information from a training corpus in an unsupervised manner. The LDA model yields a document-topic matrix that describes the number of words assigned to topics for the documents. A hard-clustering method is applied on the document-topic matrix of the LDA model to form topics. An adapted model is created by using a weighted combination of the n-gram topic models. The stand-alone adapted model outperforms the background model. The interpolation of the background model and the adapted model gives further improvement. We modify the above models using the LSM. The LSM is used to form a new adapted model by using the minimum discriminant information (MDI) adaptation approach called unigram scaling, which minimizes the distance between the new adapted model and the other model. The unigram scaling of the adapted model using LSM yields better results over a conventional unigram scaling approach. The unigram scaling of the interpolation of the background and the adapted model using the LSM outperform the background model, the unigram scaling of the background model, the unigram scaling of the adapted model, and the interpolation of the background and the adapted models respectively. We perform experiments using the '87-89 Wall Street Journal (WSJ) corpus incorporating a multi-pass continuous speech recognition (CSR) system. In the first pass, we used the background n-gram language model for lattice generation and then we apply the LM adaptation approaches for lattice rescoring in the second pass.
机译:在本文中,我们提出了使用潜在狄利克雷分配(LDA)和潜在语义边界(LSM)的无监督语言模型(LM)自适应方法。 LSM是使用LDA适应的单字模型计算出的单词上的单字概率分布。 LDA模型用于以无监督的方式从训练语料库中提取主题信息。 LDA模型产生一个文档主题矩阵,该矩阵描述了分配给文档主题的单词数。将硬聚类方法应用于LDA模型的文档主题矩阵以形成主题。通过使用n-gram主题模型的加权组合来创建适应模型。独立的适应模型优于背景模型。背景模型和适应模型的插值进一步改进。我们使用LSM修改上述模型。 LSM通过使用称为unigram缩放的最小区分信息(MDI)适应方法来形成新的适应模型,该方法可最大程度地减少新适应模型和其他模型之间的距离。使用LSM的自适应模型的字母组合缩放比传统的字母组合缩放方法产生更好的结果。使用LSM进行的背景和适应模型的插值的unigram缩放优于背景模型,背景模型的unigram缩放,适应的模型的unigram缩放以及背景和适应的模型的插值。我们使用结合了多遍连续语音识别(CSR)系统的'87 -89华尔街日报(WSJ)语料库进行实验。在第一遍中,我们使用背景n-gram语言模型进行晶格生成,然后在第二遍中将LM自适应方法应用于晶格记录。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号