首页> 外文学位 >A non-parametric model for the discovery of inflectional paradigms from plain text using graphical models over strings.
【24h】

A non-parametric model for the discovery of inflectional paradigms from plain text using graphical models over strings.

机译:一个非参数模型,用于使用字符串上的图形模型从纯文本中发现变形示例。

获取原文
获取原文并翻译 | 示例

摘要

he field of statistical natural language processing has been turning toward morphologically rich languages. These languages have vocabularies that are often orders of magnitude larger than that of English, since words may be inflected in various different ways. This leads to problems with data sparseness and calls for models that can deal with this abundance of related words---models that can learn, analyze, reduce and generate morphological inflections. But surprisingly, statistical approaches to morphology are still rare, which stands in contrast to the many recent advances of sophisticated models in parsing, grammar induction, translation and many other areas of natural language processing.;This thesis presents a novel, unified statistical approach to inflectional morphology, an approach that can decode and encode the inflectional system of a language. At the center of this approach stands the notion of inflectional paradigms. These paradigms cluster the large vocabulary of a language into structured chunks; inflections of the same word, like break, broke, breaks, breaking, ... , all belong in the same paradigm. And moreover, each of these inflections has an exact place within a paradigm, since each paradigm has designated slots for each possible inflection; for verbs, there is a slot for the first person singular indicative present, one for the third person plural subjunctive past and slots for all other possible forms. The main goal of this thesis is to build probability models over inflectional paradigms, and therefore to sort the large vocabulary of a morphologically rich language into structured clusters. These models can be learned with minimal supervision for any language that has inflectional morphology. As training data, some sample paradigms and a raw, unannotated text corpus can be used.;The models over morphological paradigms are developed in three main chapters that start with smaller components and build up to larger ones.;The first of these chapters (Chapter 2) presents novel probability models over strings and string pairs. These are applicable to lemmatization or to relate a past tense form to its associated present tense form, or for similar morphological tasks. It turns out they are general enough to tackle the popular task of transliteration very well, as well as other string-to-string tasks.;The second (Chapter 3) introduces the notion of a probability model over multiple strings, which is a novel variant of Markov Random Fields. These are used to relate the many inflections in an inflectional paradigm to one another, and they use the probability models from Chapter 2 as components. A novel version of belief propagation is presented, which propagates distributions over strings through a network of connected finite-state transducers, to perform inference in morphological paradigms (or other string fields).;Finally (Chapter 4), a non-parametric joint probability model over an unannotated text corpus and the morphological paradigms from Chapter 3 is presented. This model is based on a generative story for inflectional morphology that naturally incorporates common linguistic notions, such as lexemes, paradigms and inflections. Sampling algorithms are presented that perform inference over large text corpora and their implicit, hidden morphological paradigms. We show that they are able to discover the morphological paradigms that are implicit in the corpora. The model is based on finite-state operations and seamlessly handles concatenative and nonconcatenative morphology.
机译:统计自然语言处理领域已经转向形态丰富的语言。这些语言的词汇量通常比英语词汇量大几个数量级,因为单词可能会以各种不同的方式出现变化。这导致数据稀疏性的问题,并要求建立能够处理大量相关单词的模型,即能够学习,分析,减少和产生形态学变化的模型。但令人惊讶的是,形态学的统计方法仍然很少见,这与复杂模型在解析,语法归纳,翻译和自然语言处理的许多其他领域中的许多最新发展形成鲜明对比。;本文提出了一种新颖,统一的统计方法变形词法,一种可以对语言的变形系统进行解码和编码的方法。这种方法的核心是屈折范式的概念。这些范例将一种语言的大量词汇聚集成结构化的块。相同单词的变形,例如break,break,breaks,breaking,...,都属于同一范式。而且,由于每个范式都为每种可能的曲折指定了时隙,因此这些范式中的每一个都在范式中具有确切的位置。对于动词,第一人称单数指示性存在一个插槽,第三人称复数虚拟语态存在一个插槽,所有其他可能形式都有一个插槽。本论文的主要目的是建立在变形范式上的概率模型,从而将形态丰富的语言的大词汇分类为结构化的簇。可以在对任何具有屈折形态的语言进行最少监督的情况下学习这些模型。作为训练数据,可以使用一些示例范例和原始的,未注释的文本语料库。;形态范例的模型在三个主要章节中开发,从较小的组件开始到逐渐形成较大的组件。;这些章节中的第一章(第8章) 2)提出了关于字符串和字符串对的新颖概率模型。这些适用于词形化或将过去时形式与其相关的现在时形式联系起来,或用于类似的形态任务。事实证明,它们足够通用,可以很好地解决流行的音译任务以及其他字符串到字符串的任务。第二部分(第3章)介绍了多字符串概率模型的概念,这是一种新颖的方法。马尔可夫随机场的变体。这些用来将一个变形范例中的许多变形彼此关联起来,并且它们使用第2章中的概率模型作为组成部分。提出了一种新颖的信念传播版本,可以通过连接的有限状态换能器网络传播字符串上的分布,从而在形态学范式(或其他字符串域)中进行推断。最后(第4章),非参数联合概率一个无注释文本语料库的模型,并介绍了第3章中的形态范式。该模型基于变形形态的生成故事,该故事自然地包含了常见的语言概念,例如词素,范式和变形。提出了在大型文本语料库及其隐式,隐藏形态学范例上进行推理的采样算法。我们表明他们能够发现隐含在语料库中的形态范式。该模型基于有限状态运算,并无缝处理连接和非连接形态。

著录项

  • 作者

    Dreyer, Markus.;

  • 作者单位

    The Johns Hopkins University.;

  • 授予单位 The Johns Hopkins University.;
  • 学科 Language Linguistics.;Computer Science.
  • 学位 Ph.D.
  • 年度 2011
  • 页码 279 p.
  • 总页数 279
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:44:20

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号