首页> 外文会议>Language and Technology Conference >Multilingual Tokenization and Part-of-speech Tagging. Lightweight Versus Heavyweight Algorithms
【24h】

Multilingual Tokenization and Part-of-speech Tagging. Lightweight Versus Heavyweight Algorithms

机译:多语言标记和言语分组标记。轻量级与重量级算法

获取原文

摘要

This work focuses on morphological analysis of raw text and provides a recipe for tokenization, sentence splitting and part-of-speech tagging for all languages included in the Universal Dependencies Corpus. Scalability is an important issue when dealing with large-sized multilingual corpora. The experiments include both lightweight classifiers (linear and decision trees) and heavyweight LSTM-based architectures which are able to attain state-of-the-art results. All the experiments are carried out using the provided data "as-is". We apply lightweight and heavyweight classifiers on 5 distinct tasks, on multiple languages; we present some lessons learned during the training process; we look at perlanguage results as well as task averages, we present model footprints, and finally draw a few conclusions regarding trade-offs between the classifiers' characteristics.
机译:这项工作侧重于对原始文本的形态分析,并为通用依赖性语料库中包含的所有语言提供令牌化,句子分割和语音标记的配方。可扩展性是处理大型多语种语言的重要问题。实验包括轻量级分类器(线性和决策树)和基于重量的LSTM的架构,能够实现最先进的结果。所有实验都是使用所提供的数据“原样”进行的。我们在多种语言上涂抹于5个不同的任务的轻量级和重量级分类器;我们在培训过程中提出了一些经验教训;我们看看Perranguage结果以及任务平均值,我们呈现了模型脚印,最后在分类器的特征之间借鉴了一些关于权衡的结论。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号