Learning to lemmatize Slovene words

机译：学习掠夺斯洛涅文字

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Automatic lemmatisation is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma to each word in a running text is not trivial: nouns and adjectives, for instance, inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, as word forms cannot be matched against a lexicon giving the correct lemma, its part-of-speech and paradigm class. The paper discusses a machine learning approach to the automatic lemmatisation of unknown words, in particular nouns and adjectives, in Slovene texts. We decompose the problem of learning to perform lemmatisation into two subproblems: the first is to learn to perform morphosyntactic tagging, and the second is to learn to perform morphological analysis, which produces the lemma from the word form given the correct morphosyntactic tag. A statistics-based trigram tagger is used to learn to perform morphosyntactic tagging and a first-order decision list learning system is used to learn rules for morphological analysis. The dataset used is the 90.000 word Slovene translation of Orwell's '1984', split into a training and validation set. The validation set is the Appendix of the novel, on which extensive testing of the two components, singly and in combination, is performed. The trained model is then used on an open-domain testing set, which has 25.000 words, pre-annotated with their word lemmas. Here 13.000 nouns or adjective tokens are previously unseen cases. Tested on these unknown words, our method achieves an accuracy of 81% on the lemmatisation task.

机译：自动lemmatisation是许多语言处理任务的核心应用程序。在折衷的语言中，如斯洛涅，在运行文本中将正确的引导分配给每个单词都不是微不足道的：名词和形容词，例如，数量和案例的扭曲，具有复杂的结尾和干修改。对于未知的单词，问题尤其困难，因为Word形式不能与赋予正确的引理的词典，其语音和范例类。本文讨论了斯洛文文本中未知词，特别是名词和形容词的自动释放的机器学习方法。我们分解了学习的问题，以便将lemmatation进行分为两个子问题：首先是学习进行形态学标记，第二个是学会进行形态分析，从而产生从单词形式产生的引发物给出正确的形态学标签。基于统计数据的三元标记用于学习执行语音型标记，并且使用一阶决策列表学习系统用于学习形态分析的规则。使用的数据集是Orwell'1984'的90.000字Slovene翻译，分为培训和验证集。验证集是小说的附录，在此执行两种组件，单独和组合的广泛测试。然后将训练的模型用于开放式域测试集，其具有25.000字，用它们的单词LEMMAS预注释。这里有13.000名名词或形容词令牌以前是未看见的案例。在这些未知的单词上测试，我们的方法在lemmatation任务上实现了81％的准确性。

著录项

来源
《Learning Language in Logicin (LLL) Workshop》|2000年||共20页
会议地点
作者
Saso Dzeroski; Tomaz Erjavec;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18-53;
关键词

相似文献

外文文献
中文文献
专利

1. MACHINE LEARNING OF MORPHOSYNTACTIC STRUCTURE: LEMMATIZING UNKNOWN SLOVENE WORDS [J] . TOMAZ ERJAVEC, SASO DZEROSKI Applied Artificial Intelligence . 2004,第1期

机译：形态学结构的机器学习：使未知的松香词合法化
2. The Lemmatization of Loan Words in the isiNdebele–English isiHlathululi-imagama/Dictionary and Their Successful Incorporation into the Language [J] . Sponono Mahlangu Lexikos . 2014,第1期

机译：isidedele-英语isiHlathululi-imagama /字典中外来词的合法化及其成功整合到语言中
3. Borrowing and Loan Words: The Lemmatizing of Newly Acquired Lexical Items in Sesotho sa Leboa [J] . V.M. Mojela Lexikos . 2010,第0期

机译：借用和外来语：塞索托·萨·莱博亚新获得的词汇项目的合法化
4. Learning to lemmatize Slovene words [C] . Saso Dzeroski, Tomaz Erjavec Learning Language in Logicin (LLL) Workshop . 2000

机译：学习掠夺斯洛涅文字
5. Learning Words Under Incidental and Intentional Learning Conditions: An Eye-Tracking Study [D] . Choi, Ina 2018

机译：在偶然和故意学习条件下学习词语：一个关注的研究
6. Learning about sounds contributes to learning about words: Effects of prosody and phonotactics on infant word learning [O] . Katharine Graf Estes, Sara Bowen -1

机译：学习听起来有助于了解言语：韵律和致音牙对婴幼儿学习的影响
7. Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization ⋆ [O] . Jakub Kanis 2015

机译：自动引理器构造专注于OOV词语的词形简化⋆

Learning to lemmatize Slovene words

摘要

著录项

相似文献

相关主题

期刊订阅