首页> 外文会议>Workshop on Domain Adaptation for NLP >Challenges in Annotating and Parsing Spoken, Code-switched, Frisian-Dutch Data
【24h】

Challenges in Annotating and Parsing Spoken, Code-switched, Frisian-Dutch Data

机译:注释和解析的挑战,代码交换,弗里西亚 - 荷兰数据

获取原文

摘要

While high performance has been obtained for dependency parsing of high-resource languages, performance for low-resource languages lags behind. In this paper we focus on the parsing of the low-resource language Frisian. We use a sample of code-switched, spontaneously spoken data, which proves to be a challenging setup. We propose to train a parser specifically tailored towards the target domain, by selecting instances from multiple treebanks. Specifically, we use Latent Dirich-let Allocation (LDA), with word and character N-gram features. The best single source treebank (NL_ALPINO) resulted in an LAS of 54.7 whereas our data selection outperformed the single best transfer treebank and led to 55.6 LAS on the test data. Additional experiments consisted of removing diacritics from our Frisian data, creating more similar training data by cropping sentences and running our best model using XLM-R. These experiments did not lead to a better performance.
机译:虽然已经获得了高资源语言的依赖性解析的高性能,但低资源语言的性能落后。 在本文中,我们专注于对弗里斯兰人的低资源语言的解析。 我们使用代码切换,自发的口语数据示例,这被证明是一个具有挑战性的设置。 我们建议通过从多个TreeBanks选择实例来培训专门针对目标域定制的解析器。 具体来说,我们使用潜在的dirich-ot at dirtoply(lda),用单词和字符n-gram特征。 最好的单源TreeBank(NL_ALPINO)导致LAS为54.7,而我们的数据选择优于单一最佳传输TreeBank,并导致测试数据上的55.6 LAS。 附加实验包括从我们的弗里斯兰语数据中移除变形物,通过裁剪句子来创造更类似的培训数据并使用XLM-R运行最佳模型。 这些实验不会导致更好的表现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号