Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization

机译：JAMO对编码：基于Subcharacter表示的极端韩语词汇压缩，用于有效的子字标记

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In the context of multilingual language model pre-training, vocabulary size for languages with a broad set of potential characters is an unsolved problem. We propose two algorithms applicable in any unsupervised multilingual pre-training task, increasing the elasticity of budget required for building the vocabulary in Byte-Pair Encoding inspired tokenizers, significantly reducing the cost of supporting Korean in a multilingual model.

机译：在多语言语言模型预训练的背景下，具有广泛潜在字符的语言的词汇量是一个未解决的问题。我们提出了两种适用于任何无监督的多语言预训练任务的算法，增加了在编码灵感标记的字节对中建立词汇所需的预算的弹性，从而大大降低了在多语言模型中支持韩语的成本。

著录项

来源
《International Conference on Language Resources and Evaluation》|2020年|3490-3497|共8页
会议地点
作者
Sangwhan Moon; Naoaki Okazaki;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
tokenization; vocabulary compaction; sub-character representations; out-of-vocabulary mitigation;

机译：象征化;词汇压实;子字符表示;失败的缓解;

相似文献

外文文献
中文文献
专利

1. Efficient match pair selection for oblique UAV images based on adaptive vocabulary tree [J] . ISPRS Journal of Photogrammetry and Remote Sensing . 2020,第Mara期

机译：基于自适应词汇树的倾斜无人机图像有效匹配对选择
2. Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition [J] . Hori T., Hori C., Minami Y., IEEE transactions on audio, speech and language processing . 2007,第4期

机译：高效的基于WFST的单遍解码，具有即时假设，可极大地记录词汇量，并能连续语音识别
3. Code compression of instruction ROM by byte pair encoding [J] . Atsushi Monzen, Hiroto Yasuura 電子情報通信学会技術研究報告. VLSI設計技術. VLSI Design Technologies . 2001,第45期

机译：通过字节对编码对指令ROM进行代码压缩
4. Efficient data transfer scheme using word-pair-encoding-based compression for large-scale text-data processing [C] . Waidyasooriya Hasitha Muthumala, Ono Daisuke, Hariyama Masanori, IEEE Asia Pacific Conference on Circuits and Systems . 2014

机译：使用基于单词对编码的压缩的高效数据传输方案，用于大规模文本数据处理
5. SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning [O] . Xinhao Li, Denis Fourches 2020

机译：SMILES对编码：深度学习的数据驱动子结构标记算法

Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization

摘要

著录项

相似文献

相关主题

期刊订阅