...
首页> 外文期刊>Information Sciences: An International Journal >Modelling highly inflected languages
【24h】

Modelling highly inflected languages

机译:建模高度变形的语言

获取原文
获取原文并翻译 | 示例
           

摘要

Statistical language models encapsulate varied information, both grammatical and semantic, present in a language. This paper investigates various techniques for overcoming the difficulties in modelling highly inflected languages. The main problem is a large set of different words. We propose to model the grammatical and semantic information of words separately by splitting them into stems and endings. All the information is handled within a data-driven formalism. Grammatical information is well modelled by using short-term dependencies. This article is primarily concerned with the modelling of semantic information diffused through the entire text. It is presumed that the language being modelled is homogeneous in topic. The training corpus, which is very topically heterogeneous, is divided into three semantic levels based on topic similarity with the target environment text. Text oil each semantic level is used as training text for one component of a mixture model. A document is defined as a basic unit of a training corpus, which is semantically homogeneous. The similarity of topic between a document and a collection of target environment texts is determined by the cosine vector similarity function and TFIDF weighting heuristic. The crucial question in the case of highly inflected languages is how to define terms. Terms are defined as clusters of words. Clustering is based on approximate string matching. We experimented with Levenshtein distance and Ratcliff/Obershelp similarity measure, both in combination with ending-stripping. Experiments on the Slovenian language were performed on a corpus of VECER newswire text. The results show a significant reduction in OOV rate and perplexity. (C) 2003 Elsevier Inc. All rights reserved.
机译:统计语言模型封装了语言中存在的各种信息,包括语法信息和语义信息。本文研究了各种技术,以克服在建模高度变形的语言时遇到的困难。主要问题是一大堆不同的单词。我们建议通过将单词的语法和语义信息分为词干和结尾来分别建模。所有信息都在数据驱动的形式主义中处理。通过使用短期依赖关系可以很好地建模语法信息。本文主要关注遍及整个文本的语义信息的建模。假定正在建模的语言在主题上是同类的。基于主题与目标环境文本的相似性,训练语料库在局部上非常异构,分为三个语义级别。每个语义级别的文本用作混合模型一个组成部分的训练文本。文档被定义为训练语料库的基本单元,它在语义上是同质的。文档和目标环境文本集合之间主题的相似性由余弦矢量相似度函数和TFIDF加权启发式确定。在语言高度变化的情况下,关键问题是如何定义术语。术语被定义为单词簇。聚类基于近似字符串匹配。我们对Levenshtein距离和Ratcliff / Obershelp相似性度量进行了实验,二者均与末端剥离相结合。在VECER新闻专线语料库上进行了斯洛文尼亚语的实验。结果表明,OOV率和困惑度显着降低。 (C)2003 Elsevier Inc.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号