首页> 外文会议>Nordic conference of computational Linguistics >Toward Multilingual Identification of Online Registers
【24h】

Toward Multilingual Identification of Online Registers

机译:在线寄存器的多语言识别

获取原文

摘要

We consider cross- and multilingual text classification approaches to the identification of online registers (genres), i.e. text varieties with specific situational characteristics. Register is arguably the most important predictor of linguistic variation, and register information could improve the potential of online data for many applications. We introduce the Finnish Corpus of Online REgisters (FinCORE), the first manually annotated non-English corpus of online registers featuring the full range of linguistic variation found online. The data set consists of 2,237 Finnish documents and follows the register taxonomy developed for the Corpus of Online Registers of English (CORE), the largest manually annotated language collection of online registers. Using CORE and FinCORE data, we demonstrate the feasibility of cross-lingual register identification using a simple approach based on convo-lutional neural networks and multilingual word embeddings. We further find that register identification results can be improved through multilingual training even when a substantial number of annotations is available in the target language.
机译:我们考虑识别网上寄存器(流体)的交叉和多语言文本分类方法,即具有特定情境特征的文本品种。寄存器可以说是语言变异最重要的预测因子,并且寄存器信息可以改善许多应用程序的在线数据的潜力。我们介绍了在线寄存器(FINCORE)的芬兰语法,这是在线发现的全部语言寄存器的第一批手动注释的非英语语料库。数据集由2,237名芬兰文档组成,并遵循寄存器分类,为英语(核心)的在线寄存器语料库,是在线登记的最大手动注释的语言集合。使用核心和FINCORE数据,我们使用基于Convo-Lutional神经网络和多语言单词嵌入的简单方法来展示交叉定语识别识别的可行性。我们进一步发现,即使目标语言中有大量的注释,也可以通过多语言培训来提高寄存器识别结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号