Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words

机译：展示伯特牌的调整挑战与词汇外单词的子字表示

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

BERT model (Devlin et al., 2019) has achieved significant progress in several Natural Language Processing (NLP) tasks by leveraging the multi-head self-attention mechanism (Vaswani et al., 2017) in its architecture. However, it still has several research challenges which are not tackled well for domain specific corpus found in industries. In this paper, we have highlighted these problems through detailed experiments involving analysis of the attention scores and dynamic word embeddings with the BERT-Base-Uncased model. Our experiments have lead to interesting findings that showed: 1) Largest substring from the left that is found in the vocabulary (in-vocab) is always chosen at every sub-word unit that can lead to suboptimal tokenization choices, 2) Semantic meaning of a vocabulary word deteriorates when found as a substring in an Out-Of-Vocabulary (OOV) word, and 3) Minor misspellings in words are inadequately handled. We believe that if these challenges are tackled, it will significantly help the domain adaptation aspect of BERT.

机译：BERT Model（Devlin等，2019）通过利用多头自我关注机制（Vaswai等，2017）在其建筑中实现了几种自然语言处理（NLP）任务的显着进展。然而，它仍然有几项研究挑战，在行业中发现的域特异性语料库并不妥善。在本文中，我们通过详细的实验突出了这些问题，涉及分析注意力分数和具有伯爵基础 - 未应用模型的动态词嵌入的实验。我们的实验导致有趣的结果显示：1）从左侧的最大子字符串在词汇（词汇表中）总是在每个子字单元上选择，可以导致次优标记选择，2）语义含义当在词汇外（OOV）字中的子字符串中发现时，词汇单词劣化，并且3）单词中的次要拼写显示不足。我们认为，如果解决这些挑战，它将大大帮助伯特的领域适应方面。

著录项

来源
《Workshop on Insights from Negative Results in NLP》|2020年|1-5|共5页
会议地点
作者
Anmol Nayak; Hari P. Timmapathini; Karthikeyan Ponnalagu; Vijendran Venkoparao;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Decoding with sub-word network models for out-of-vocabulary words recognition [J] . Hiroaki Kokubo, Shigehiko Onishi, Hirofumi Yamamoto, 電子情報通信学会技術研究報告. 音声. Speech . 2001,第156期

机译：利用子词网络模型进行解码，以识别词汇外的词
2. Decoding with sub-word network models for out-of-vocabulary words recognition [J] . Hiroaki Kokubo, Shigehiko Onishi, Hirofumi Yamamoto, 電子情報通信学会技術研究報告. 音声. Speech . 2001,第156期

机译：用子字网络模型进行解码，用于失控单词识别
3. Out-of-vocabulary word modeling by using sub-word units [J] . Shigehiko Onishi, Hiroaki Kokubo, Hirofumi Yamamoto, 電子情報通信学会技術研究報告. 音声. Speech . 2001,第31期

机译：使用子词单元进行词汇外词建模
4. Learning Domain Invariant Word Representations for Parsing Domain Adaptation [C] . Xiuming Qiao, Yue Zhang, Tiejun Zhao CCF international conference on natural language processing and Chinese computing . 2019

机译：学习领域不变词表示法以解析领域适应
5. Incorporate Out-of-Vocabulary Words for Psycholinguistic Analysis using Social Media Texts - An OOV-Aware Data Curation Process and a Hybrid Approach [D] . Liu, Kun. 2021

机译：利用社交媒体文本融入了词汇语言学分析的失语单词 - OOV感知数据策委和混合方法
6. Does BERT need domain adaptation for clinical negation detection? [O] . Chen Lin, Steven Bethard, Dmitriy Dligach, 2020

机译：BERT需要临床否定检测的域适应吗？
7. Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words [O] . Anmol Nayak, Hariprasad Timmapathini, Karthikeyan Ponnalagu, 2020

机译：展示伯特牌的调整挑战与词汇外单词的子字表示

Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words

摘要

著录项

相似文献

相关主题

期刊订阅