Modelling highly inflected languages

Maucec MS; Kacic Z; Horvat B

首页> 外文期刊>Information Sciences: An International Journal >Modelling highly inflected languages

【24h】

Modelling highly inflected languages

机译：建模高度变形的语言

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Statistical language models encapsulate varied information, both grammatical and semantic, present in a language. This paper investigates various techniques for overcoming the difficulties in modelling highly inflected languages. The main problem is a large set of different words. We propose to model the grammatical and semantic information of words separately by splitting them into stems and endings. All the information is handled within a data-driven formalism. Grammatical information is well modelled by using short-term dependencies. This article is primarily concerned with the modelling of semantic information diffused through the entire text. It is presumed that the language being modelled is homogeneous in topic. The training corpus, which is very topically heterogeneous, is divided into three semantic levels based on topic similarity with the target environment text. Text oil each semantic level is used as training text for one component of a mixture model. A document is defined as a basic unit of a training corpus, which is semantically homogeneous. The similarity of topic between a document and a collection of target environment texts is determined by the cosine vector similarity function and TFIDF weighting heuristic. The crucial question in the case of highly inflected languages is how to define terms. Terms are defined as clusters of words. Clustering is based on approximate string matching. We experimented with Levenshtein distance and Ratcliff/Obershelp similarity measure, both in combination with ending-stripping. Experiments on the Slovenian language were performed on a corpus of VECER newswire text. The results show a significant reduction in OOV rate and perplexity. (C) 2003 Elsevier Inc. All rights reserved.

机译：统计语言模型封装了语言中存在的各种信息，包括语法信息和语义信息。本文研究了各种技术，以克服在建模高度变形的语言时遇到的困难。主要问题是一大堆不同的单词。我们建议通过将单词的语法和语义信息分为词干和结尾来分别建模。所有信息都在数据驱动的形式主义中处理。通过使用短期依赖关系可以很好地建模语法信息。本文主要关注遍及整个文本的语义信息的建模。假定正在建模的语言在主题上是同类的。基于主题与目标环境文本的相似性，训练语料库在局部上非常异构，分为三个语义级别。每个语义级别的文本用作混合模型一个组成部分的训练文本。文档被定义为训练语料库的基本单元，它在语义上是同质的。文档和目标环境文本集合之间主题的相似性由余弦矢量相似度函数和TFIDF加权启发式确定。在语言高度变化的情况下，关键问题是如何定义术语。术语被定义为单词簇。聚类基于近似字符串匹配。我们对Levenshtein距离和Ratcliff / Obershelp相似性度量进行了实验，二者均与末端剥离相结合。在VECER新闻专线语料库上进行了斯洛文尼亚语的实验。结果表明，OOV率和困惑度显着降低。（C）2003 Elsevier Inc.保留所有权利。

著录项

来源
《Information Sciences: An International Journal》 |2004年第4期|共21页
作者
Maucec MS; Kacic Z; Horvat B;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词
statistical language modelling; perplexity; topic similarity; mixture-based model; vector-space retrieval model; approximate string matching; NATURAL-LANGUAGE;

机译：统计语言建模;困惑度;主题相似度;基于混合的模型;向量空间检索模型;近似字符串匹配;自然语言;

相似文献

外文文献
中文文献
专利

1. How the input shapes the acquisition of verb morphology: Elicited production and computational modelling in two highly inflected languages [J] . Engelmann Felix, Granlund Sonia, Kolak Joanna, Cognitive Psychology . 2019,第期

机译：输入如何塑造动词形态的获取：以两种高度变化的语言引发生产和计算建模
2. USING DATA-DRIVEN SUBWORD UNITS IN LANGUAGE MODEL OF HIGHLY INFLECTIVE SLOVENIAN LANGUAGE [J] . MIRJAM SEPESY MAUCEC, TOMAZ ROTOVNIK, ZDRAVKO KACIC, International Journal of Pattern Recognition and Artificial Intelligence . 2009,第2期

机译：在高反斯洛文尼亚语的语言模型中使用数据驱动的子词单位
3. Cache-based Statistical Language Models of English and Highly Inflected Lithuanian [J] . Airenas Vaiciunas, Gailius Raskinis Informatica . 2006,第1期

机译：基于缓存的英语和高度变形的立陶宛语统计语言模型
4. Neural network based language models for highly inflective languages [C] . Mikolov T., Kopecky J., Burget L., IEEE International Conference on Acoustics, Speech and Signal Processing;ICASSP 2009 . 2009

机译：基于神经网络的语言模型，用于高度折衷的语言
5. Processing highly variant language using incremental model selection [D] . Rodrigues, Paul 2012

机译：使用增量模型选择处理高度变异的语言
6. Rules vs. Statistics: Insights from a Highly Inflected Language [O] . Jelena Mirković, Mark S. Seidenberg, Marc F. Joanisse -1

机译：规则与统计：从高度屈折语见解
7. Highly-inflected Language Generation using Factored Language Models [O] . Eder Mir, A De Novais, Ré Paraboni, 2014

机译：使用因式语言模型生成高度语言的语言

Modelling highly inflected languages

摘要

著录项

相似文献

相关主题

期刊订阅