首页> 外文期刊>Engineering Economics >Evaluation of Language Models over Croatian Newspaper Texts
【24h】

Evaluation of Language Models over Croatian Newspaper Texts

机译:克罗地亚报纸文本的语言模型评估

获取原文
           

摘要

Statistical language modeling involves techniques and procedures that assign probabilities to word sequences or, said in other words, estimate the regularity of the language. This paper presents basic characteristics of statistical language models, reviews their use in the large set of speech and language applications, explains their formal definition and shows different types of language models. Detailed overview of n-gram and class-based models (as well as their combinations) is given chronologically, by type and complexity of models, and in aspect of their use in different NLP applications for different natural languages. The proposed experimental procedure compares three different types of statistical language models: n-gram models based on words, categorical models based on automatically determined categories and categorical models based on POS tags. In the paper, we propose a language model for contemporary Croatian texts, a procedure how to determine the best n-gram and the optimal number of categories, which leads to significant decrease of language model perplexity, estimated from the Croatian News Agency articles (HINA) corpus. Using different language models estimated from the HINA corpus, we show experimentally that models based on categories contribute to a better description of the natural language than those based on words. These findings of the proposed experiment are applicable, except for Croatian, for similar highly inflectional languages with rich morphology and non-mandatory sentence word order.DOI: http://dx.doi.org/10.5755/j01.itc.46.4.18367.
机译:统计语言建模涉及将概率分配给单词序列或换句话说估算语言规则性的技术和过程。本文介绍了统计语言模型的基本特征,回顾了它们在大量语音和语言应用程序中的使用,解释了其正式定义并展示了不同类型的语言模型。按时间顺序,模型的类型和复杂性以及在不同自然语言的不同NLP应用程序中使用它们的方面,按时间顺序给出了基于n-gram和基于类的模型(及其组合)的详细概述。拟议的实验过程比较了三种不同类型的统计语言模型:基于单词的n-gram模型,基于自动确定类别的分类模型和基于POS标签的分类模型。在本文中,我们提出了一种用于当代克罗地亚语文本的语言模型,该程序如何确定最佳n-gram和类别的最佳数量,从而导致语言模型的困惑度显着降低(根据克罗地亚新闻社的文章(HINA) )语料库。使用从HINA语料库估计的不同语言模型,我们通过实验证明,基于类别的模型比基于单词的模型对自然语言的描述更好。拟议实验的这些发现(克罗地亚语除外)适用于形态丰富且句子句序非强制的类似高屈折语言。DOI:http://dx.doi.org/10.5755/j01.itc.46.4.18367 。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号