首页> 外文会议>Nordic conference of computational Linguistics >Toward Multilingual Identification of Online Registers
【24h】

Toward Multilingual Identification of Online Registers

机译:寻求在线注册的多语言识别

获取原文

摘要

We consider cross- and multilingual text classification approaches to the identification of online registers (genres), i.e. text varieties with specific situational characteristics. Register is arguably the most important predictor of linguistic variation, and register information could improve the potential of online data for many applications. We introduce the Finnish Corpus of Online REgisters (FinCORE), the first manually annotated non-English corpus of online registers featuring the full range of linguistic variation found online. The data set consists of 2,237 Finnish documents and follows the register taxonomy developed for the Corpus of Online Registers of English (CORE), the largest manually annotated language collection of online registers. Using CORE and FinCORE data, we demonstrate the feasibility of cross-lingual register identification using a simple approach based on convo-lutional neural networks and multilingual word embeddings. We further find that register identification results can be improved through multilingual training even when a substantial number of annotations is available in the target language.
机译:我们考虑使用跨语言和多语言的文本分类方法来识别在线注册(流派),即具有特定情况特征的文本变体。寄存器可以说是语言变异的最重要的预测指标,而寄存器信息可以提高在线数据在许多应用中的潜力。我们介绍了芬兰在线注册语料库(FinCORE),这是第一个手动注释的非英语在线注册语料库,具有在线上发现的所有语言变化的全部范围。该数据集包含2,237个芬兰文档,并遵循为英语在线注册语料库(CORE)开发的注册者分类法,CORE是英语最大的手动注释在线注册语料库。使用CORE和FinCORE数据,我们证明了使用基于卷积神经网络和多语言单词嵌入的简单方法进行跨语言寄存器识别的可行性。我们进一步发现,即使目标语言中有大量注释可用,通过多语言培训也可以改善寄存器识别结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号