首页> 外文会议>International Conference on Language Resources and Evaluation >Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization
【24h】

Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization

机译:JAMO对编码:基于Subcharacter表示的极端韩语词汇压缩,用于有效的子字标记

获取原文

摘要

In the context of multilingual language model pre-training, vocabulary size for languages with a broad set of potential characters is an unsolved problem. We propose two algorithms applicable in any unsupervised multilingual pre-training task, increasing the elasticity of budget required for building the vocabulary in Byte-Pair Encoding inspired tokenizers, significantly reducing the cost of supporting Korean in a multilingual model.
机译:在多语言语言模型预训练的背景下,具有广泛潜在字符的语言的词汇量是一个未解决的问题。我们提出了两种适用于任何无监督的多语言预训练任务的算法,增加了在编码灵感标记的字节对中建立词汇所需的预算的弹性,从而大大降低了在多语言模型中支持韩语的成本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号