首页> 外文学位 >Modeling dependencies in natural languages with latent variables .
【24h】

Modeling dependencies in natural languages with latent variables .

机译:用潜在变量对自然语言中的依赖进行建模。

获取原文
获取原文并翻译 | 示例

摘要

In this thesis, we investigate the use of latent variables to model complex dependencies in natural languages. Traditional models, which have a fixed parameterization, often make strong independence assumptions that lead to poor performance. This problem is often addressed by incorporating additional dependencies into the model (e.g., using higher order N-grams for language modeling). These added dependencies can increase data sparsity and/or require expert knowledge, together with trial and error, in order to identify and incorporate the most important dependencies (as in lexicalized parsing models). Traditional models, when developed for a particular genre, domain, or language, are also often difficult to adapt to another.;In contrast, previous work has shown that latent variable models, which automatically learn dependencies in a data-driven way, are able to flexibly adjust the number of parameters based on the type and the amount of training data available. We have created several different types of latent variable models for a diverse set of natural language processing applications, including novel models for part-of-speech tagging, language modeling, and machine translation, and an improved model for parsing. These models perform significantly better than traditional models. We have also created and evaluated three different methods for improving the performance of latent variable models. While these methods can be applied to any of our applications, we focus our experiments on parsing.;The first method involves self-training, i.e., we train models using a combination of gold standard training data and a large amount of automatically labeled training data. We conclude from a series of experiments that the latent variable models benefit much more from self-training than conventional models, apparently due to their flexibility to adjust their model parameterization to learn more accurate models from the additional automatically labeled training data.;The second method takes advantage of the variability among latent variable models to combine multiple models for enhanced performance. We investigate several different training protocols to combine self-training with model combination. We conclude that these two techniques are complementary to each other and can be effectively combined to train very high quality parsing models.;The third method replaces the generative multinomial lexical model of latent variable grammars with a feature-rich log-linear lexical model to provide a principled solution to address data sparsity, handle out-of-vocabulary words, and exploit overlapping features during model induction. We conclude from experiments that the resulting grammars are able to effectively parse three different languages.;This work contributes to natural language processing by creating flexible and effective latent variable models for several different languages. Our investigation of self-training, model combination, and log-linear models also provides insights into the effective application of these machine learning techniques to other disciplines.
机译:在本文中,我们研究了使用潜在变量对自然语言中的复杂依存关系进行建模。具有固定参数设置的传统模型通常会做出强烈的独立性假设,从而导致性能不佳。通常通过将附加的依赖项合并到模型中来解决该问题(例如,使用更高阶的N元语法进行语言建模)。这些附加的依赖关系可能会增加数据稀疏性和/或需要专家知识以及反复试验,以便识别和合并最重要的依赖关系(如词法分析模型中一样)。当针对特定类型,领域或语言开发传统模型时,通常也很难适应其他模型。相反,以前的工作表明,潜在变量模型能够以数据驱动的方式自动学习依赖关系。根据训练数据的类型和数量灵活地调整参数的数量。我们为多种自然语言处理应用程序创建了几种不同类型的潜在变量模型,包括用于词性标注,语言建模和机器翻译的新颖模型,以及用于解析的改进模型。这些模型的性能明显优于传统模型。我们还创建并评估了三种不同的方法来改善潜在变量模型的性能。尽管这些方法可以应用于我们的任何应用程序,但我们将实验重点放在了解析上。第一种方法涉及自我训练,即,我们使用黄金标准训练数据和大量自动标记的训练数据的组合来训练模型。我们从一系列实验中得出结论,潜在变量模型比常规模型从自训练中受益更多,这显然是因为它们可以灵活地调整模型参数化以从附加的自动标记的训练数据中学习更准确的模型。利用潜在变量模型之间的可变性来组合多个模型以增强性能。我们研究了几种不同的训练方案,以结合模型的自我训练。我们得出结论,这两种技术是互补的,可以有效地组合以训练非常高质量的解析模型。第三种方法是用特征丰富的对数线性词法模型代替潜在变量语法的生成多项式词法模型,以提供解决数据稀疏性,处理词汇量不足并在模型归纳过程中利用重叠功能的有原则的解决方案。我们从实验中得出结论,生成的语法能够有效地解析三种不同的语言。这项工作通过为几种不同的语言创建灵活而有效的潜在变量模型,为自然语言处理做出了贡献。我们对自训练,模型组合和对数线性模型的研究还提供了将这些机器学习技术有效应用于其他学科的见解。

著录项

  • 作者

    Huang, Zhongqiang.;

  • 作者单位

    University of Maryland, College Park.;

  • 授予单位 University of Maryland, College Park.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2011
  • 页码 227 p.
  • 总页数 227
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号