Evaluation of Language Models over Croatian Newspaper Texts

Beliga Slobodan; Ip?i? Ivo; Martin?i?-Ip?i? Sanda

首页> 外文期刊>Engineering Economics >Evaluation of Language Models over Croatian Newspaper Texts

【24h】

Evaluation of Language Models over Croatian Newspaper Texts

机译：克罗地亚报纸文本的语言模型评估

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Statistical language modeling involves techniques and procedures that assign probabilities to word sequences or, said in other words, estimate the regularity of the language. This paper presents basic characteristics of statistical language models, reviews their use in the large set of speech and language applications, explains their formal definition and shows different types of language models. Detailed overview of n-gram and class-based models (as well as their combinations) is given chronologically, by type and complexity of models, and in aspect of their use in different NLP applications for different natural languages. The proposed experimental procedure compares three different types of statistical language models: n-gram models based on words, categorical models based on automatically determined categories and categorical models based on POS tags. In the paper, we propose a language model for contemporary Croatian texts, a procedure how to determine the best n-gram and the optimal number of categories, which leads to significant decrease of language model perplexity, estimated from the Croatian News Agency articles (HINA) corpus. Using different language models estimated from the HINA corpus, we show experimentally that models based on categories contribute to a better description of the natural language than those based on words. These findings of the proposed experiment are applicable, except for Croatian, for similar highly inflectional languages with rich morphology and non-mandatory sentence word order.DOI: http://dx.doi.org/10.5755/j01.itc.46.4.18367.

机译：统计语言建模涉及将概率分配给单词序列或换句话说估算语言规则性的技术和过程。本文介绍了统计语言模型的基本特征，回顾了它们在大量语音和语言应用程序中的使用，解释了其正式定义并展示了不同类型的语言模型。按时间顺序，模型的类型和复杂性以及在不同自然语言的不同NLP应用程序中使用它们的方面，按时间顺序给出了基于n-gram和基于类的模型（及其组合）的详细概述。拟议的实验过程比较了三种不同类型的统计语言模型：基于单词的n-gram模型，基于自动确定类别的分类模型和基于POS标签的分类模型。在本文中，我们提出了一种用于当代克罗地亚语文本的语言模型，该程序如何确定最佳n-gram和类别的最佳数量，从而导致语言模型的困惑度显着降低（根据克罗地亚新闻社的文章（HINA））语料库。使用从HINA语料库估计的不同语言模型，我们通过实验证明，基于类别的模型比基于单词的模型对自然语言的描述更好。拟议实验的这些发现（克罗地亚语除外）适用于形态丰富且句子句序非强制的类似高屈折语言。DOI：http://dx.doi.org/10.5755/j01.itc.46.4.18367 。

著录项

来源
《Engineering Economics》 |2017年第4期|共20页
作者
Beliga Slobodan; Ip?i? Ivo; Martin?i?-Ip?i? Sanda;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类工业经济;
关键词

相似文献

外文文献
中文文献
专利

1. Language morphology offset: Text classification on a Croatian-English parallel corpus [J] . M. Malenica, T. Smuc, J. Snajder, Information Processing & Management . 2008,第1期

机译：语言形态偏移：克罗地亚语-英语平行语料库上的文本分类
2. TEXT-TO-SPEECH SYNTHESIS: A PROTOTYPE SYSTEM FOR CROATIAN LANGUAGE [J] . Miran POBAR, Sanda MARTINCIC-IPSIC, Ivo IPSIC Engineering Review . 2008,第2期

机译：文本到语音合成：克罗地亚语的原型系统
3. A Neural Network model for the Evaluation of Text Complexity in Italian Language: a Representation Point of View [J] . Giosué Lo Bosco, Giovanni Pilato, Daniele Schicchi Procedia Computer Science . 2018,第5期

机译：意大利语言文本复杂性评估的神经网络模型：一种代表的观点
4. Infectious texts: Modeling text reuse in nineteenth-century newspapers [C] . Smith David A., Cordell Ryan, Dillon Elizabeth Maddock 2013 IEEE International Conference on Big Data . 2013

机译：传染性文本：在19世纪报纸中模拟文本重用
5. COMPUTER-ASSISTED AND TRADITIONAL METHODS OF TEXT ANALYSIS - A COMPARATIVE STUDY OF EAST AND WEST GERMAN NEWSPAPER LANGUAGE (SOCIOLINGUISTICS, TEXT LINGUISTICS). [D] . KEMPF, RENATE UTA. 1984

机译：文本分析的计算机辅助和传统方法-东西方德语报纸语言（社会语言学，文本语言学）的比较研究。
6. De-identification of Clinical Text via Bi-LSTM-CRF with Neural Language Models [O] . Buzhou Tang, Dehuan Jiang, Qingcai Chen, 2019

机译：通过带有神经语言模型的Bi-LSTM-CRF取消对临床文本的识别
7. Standard Informative and Expressive Language Tools in Publicistic and Journalistic Texts of Newspapers in the Chinese and Thai Languages [O] . Svetlana Yurievna Glushkova 2019

机译：中文和泰式语言报纸的公共和新闻文本的标准信息和表现力语言工具
8. Development and Evaluation of Self-Instructional Texts and an Operational Specification for Computer Directed Training in Intermediate Query Language, Model 11, for System 473L, United States Air Force Headquarters. [R] . Slough, D. C., Yens, D. P., Northrup, J. L., 1966

机译：针对美国空军总部473L系统的模型11的中级查询语言计算机导向培训的自学教材和操作规范的开发和评估。

Evaluation of Language Models over Croatian Newspaper Texts

摘要

著录项

相似文献

相关主题

期刊订阅