【24h】

An Extensible Multilingual Open Source Lemmatizer

机译:可扩展的多语言开源缩微化器

获取原文

摘要

We present GATE DictLemmatizer, a multilingual open source lemmatizer for the GATE NLP framework that currently supports English, German, Italian, French, Dutch, and Spanish, and is easily extensible to other languages. The software is freely available under the LGPL license. The lemmatization is based on the Helsinki Finite-State Transducer Technology (HFST) and lemma dictionaries automatically created from Wiktionary. We evaluate the performance of the lemmatizers against TreeTagger, which is only freely available for research purposes. Our evaluation shows that DictLemmatizer achieves similar or even better results than TreeTagger for languages where there is support from HFST. The performance drops when there is no support from HFST and the entire lemmatization process is based on lemma dictionaries. However, the results are still satisfactory given the fact that DictLemmatizer is open-source and can be easily extended to other languages. The software for extending the lemmatizer by creating word lists from Wiktionary dictionaries is also freely available as open-source software.
机译:我们介绍GATE DictLemmatizer,这是一种适用于GATE NLP框架的多语言开源lemmatizer,目前支持英语,德语,意大利语,法语,荷兰语和西班牙语,并且可以轻松扩展为其他语言。该软件可根据LGPL许可免费获得。词形化基于赫尔辛基有限状态换能器技术(HFST)和由维基词典自动创建的词形词典。我们评估针对TreeTagger的lemmatizers的性能,TreeTagger仅可免费用于研究目的。我们的评估表明,对于有HFST支持的语言,DictLemmatizer的结果比TreeTagger甚至更好。如果没有HFST的支持,并且整个lemmatization过程都基于lemma词典,则性能会下降。但是,由于DictLemmatizer是开源的,并且可以轻松扩展到其他语言,因此结果仍然令人满意。通过Wiktionary词典创建单词列表来扩展lemmatizer的软件也可以作为开源软件免费提供。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号