首页> 外文会议>Workshop on Insights from Negative Results in NLP >Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words
【24h】

Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words

机译:展示伯特牌的调整挑战与词汇外单词的子字表示

获取原文

摘要

BERT model (Devlin et al., 2019) has achieved significant progress in several Natural Language Processing (NLP) tasks by leveraging the multi-head self-attention mechanism (Vaswani et al., 2017) in its architecture. However, it still has several research challenges which are not tackled well for domain specific corpus found in industries. In this paper, we have highlighted these problems through detailed experiments involving analysis of the attention scores and dynamic word embeddings with the BERT-Base-Uncased model. Our experiments have lead to interesting findings that showed: 1) Largest substring from the left that is found in the vocabulary (in-vocab) is always chosen at every sub-word unit that can lead to suboptimal tokenization choices, 2) Semantic meaning of a vocabulary word deteriorates when found as a substring in an Out-Of-Vocabulary (OOV) word, and 3) Minor misspellings in words are inadequately handled. We believe that if these challenges are tackled, it will significantly help the domain adaptation aspect of BERT.
机译:BERT Model(Devlin等,2019)通过利用多头自我关注机制(Vaswai等,2017)在其建筑中实现了几种自然语言处理(NLP)任务的显着进展。然而,它仍然有几项研究挑战,在行业中发现的域特异性语料库并不妥善。在本文中,我们通过详细的实验突出了这些问题,涉及分析注意力分数和具有伯爵基础 - 未应用模型的动态词嵌入的实验。我们的实验导致有趣的结果显示:1)从左侧的最大子字符串在词汇(词汇表中)总是在每个子字单元上选择,可以导致次优标记选择,2)语义含义当在词汇外(OOV)字中的子字符串中发现时,词汇单词劣化,并且3)单词中的次要拼写显示不足。我们认为,如果解决这些挑战,它将大大帮助伯特的领域适应方面。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号