首页> 外文会议>Learning Language in Logicin (LLL) Workshop >Learning to lemmatize Slovene words
【24h】

Learning to lemmatize Slovene words

机译:学习掠夺斯洛涅文字

获取原文

摘要

Automatic lemmatisation is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma to each word in a running text is not trivial: nouns and adjectives, for instance, inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, as word forms cannot be matched against a lexicon giving the correct lemma, its part-of-speech and paradigm class. The paper discusses a machine learning approach to the automatic lemmatisation of unknown words, in particular nouns and adjectives, in Slovene texts. We decompose the problem of learning to perform lemmatisation into two subproblems: the first is to learn to perform morphosyntactic tagging, and the second is to learn to perform morphological analysis, which produces the lemma from the word form given the correct morphosyntactic tag. A statistics-based trigram tagger is used to learn to perform morphosyntactic tagging and a first-order decision list learning system is used to learn rules for morphological analysis. The dataset used is the 90.000 word Slovene translation of Orwell's '1984', split into a training and validation set. The validation set is the Appendix of the novel, on which extensive testing of the two components, singly and in combination, is performed. The trained model is then used on an open-domain testing set, which has 25.000 words, pre-annotated with their word lemmas. Here 13.000 nouns or adjective tokens are previously unseen cases. Tested on these unknown words, our method achieves an accuracy of 81% on the lemmatisation task.
机译:自动lemmatisation是许多语言处理任务的核心应用程序。在折衷的语言中,如斯洛涅,在运行文本中将正确的引导分配给每个单词都不是微不足道的:名词和形容词,例如,数量和案例的扭曲,具有复杂的结尾和干修改。对于未知的单词,问题尤其困难,因为Word形式不能与赋予正确的引理的词典,其语音和范例类。本文讨论了斯洛文文本中未知词,特别是名词和形容词的自动释放的机器学习方法。我们分解了学习的问题,以便将lemmatation进行分为两个子问题:首先是学习进行形态学标记,第二个是学会进行形态分析,从而产生从单词形式产生的引发物给出正确的形态学标签。基于统计数据的三元标记用于学习执行语音型标记,并且使用一阶决策列表学习系统用于学习形态分析的规则。使用的数据集是Orwell'1984'的90.000字Slovene翻译,分为培训和验证集。验证集是小说的附录,在此执行两种组件,单独和组合的广泛测试。然后将训练的模型用于开放式域测试集,其具有25.000字,用它们的单词LEMMAS预注释。这里有13.000名名词或形容词令牌以前是未看见的案例。在这些未知的单词上测试,我们的方法在lemmatation任务上实现了81%的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号