Using Sub-character Level Information for Neural Machine Translation of Logographic Languages

Zhang Longtu; Komachi Mamoru

首页> 外文期刊>ACM transactions on Asian and low-resource language information processing >Using Sub-character Level Information for Neural Machine Translation of Logographic Languages

【24h】

Using Sub-character Level Information for Neural Machine Translation of Logographic Languages

机译：利用子字符级别信息来逻辑语言翻译

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Logographic and alphabetic languages (e.g., Chinese vs. English) have different writing systems linguistically. Languages belonging to the same writing system usually exhibit more sharing information, which can be used to facilitate natural language processing tasks such as neural machine translation (NMT). This article takes advantage of the logographic characters in Chinese and Japanese by decomposing them into smaller units, thus more optimally utilizing the information these characters share in the training of NMT systems in both encoding and decoding processes. Experiments show that the proposed method can robustly improve the NMT performance of both "logographic" language pairs (JA-ZH) and "logographic + alphabetic" (JA-EN and ZH-EN) language pairs in both supervised and unsupervised NMT scenarios. Moreover, as the decomposed sequences are usually very long, extra position features for the transformer encoder can help with the modeling of these long sequences. The results also indicate that, theoretically, linguistic features can be manipulated to obtain higher share token rates and further improve the performance of natural language processing systems.

机译：逻辑和字母语言（例如，中国与英语）对语言有不同的写作系统。属于相同的写作系统的语言通常展示了更多的共享信息，可用于促进自然语言处理任务，例如神经机器翻译（NMT）。本文利用中文和日语的逻辑字符通过将它们分解成较小的单位，从而更加最佳利用这些字符在编码和解码过程中的NMT系统训练中共享。实验表明，该方法可以鲁布布地改善“逻辑”语言对（JA-Zh）和“逻辑+字母”（JA-ZH和Zh-Zh）语言对的NMT性能，在监督和无监督的NMT场景中都有语言对。此外，由于分解序列通常是很长的，变压器编码器的额外位置特征可以帮助这些长序列的建模。结果还表明，理论上，理论上，可以操纵语言特征以获得更高的份额令牌速率并进一步提高自然语言处理系统的性能。

著录项

来源
《ACM transactions on Asian and low-resource language information processing》 |2021年第2期|31.1-31.15|共15页
作者
Zhang Longtu; Komachi Mamoru;
展开▼
作者单位

Tokyo Metropolitan Univ 6-6 Asahigaoka Tokyo 1910065 Japan;

Tokyo Metropolitan Univ 6-6 Asahigaoka Tokyo 1910065 Japan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Neural machine translation; logographic languages; shared information; unsupervised NMT;

机译：神经机翻译;逻辑语言;共享信息;无监督的NMT;

相似文献

外文文献
中文文献
专利

1. Hierarchical Character Embeddings: Learning Phonological and Semantic Representations in Languages of Logographic Origin Using Recursive Neural Networks [J] . Minh Nguyen, Gia H. Ngo, Nancy F. Chen Audio, Speech, and Language Processing, IEEE/ACM Transactions on . 2020,第期

机译：分层字符嵌入：使用递归神经网络的逻辑原产语言学习语言和语义表示
2. Experience of neural machine translation between Indian languages [J] . Dewangan Shubham, Alva Shreya, Joshi Nitish, Machine translation . 2021,第1期

机译：印度语言中神经机翻译的体验
3. Neural machine translation of low-resource languages using SMT phrase pair injection [J] . Sukanta Sen, Mohammed Hasanuzzaman, Asif Ekbal, Natural language engineering . 2021,第Pta3期

机译：使用SMT短语对注射的低资源语言的神经机翻译
4. Neural Machine Translation of Logographic Languages Using Sub-character Level Information [C] . Longtu Zhang, Mamoru Komachi Conference on machine translation . 2018

机译：使用子字符级信息的逻辑语言的神经机器翻译
5. A Crowd-Powered Conversational Assistant for the Improvement of a Neural Machine Translation System in Native Peruvian Language [D] . Gómez Montoya, Héctor Erasmo. 2019

机译：一种人群，用于改进本土秘密语言中神经机翻译系统的人群对话助理
6. Neural machine translation of clinical texts between long distance languages [O] . Xabier Soto, Olatz Perez-de-Viñaspre, Gorka Labaka, 2019

机译：长途语言之间临床文本的神经机翻译
7. Neural Machine Translation of Logographic Language Using Sub-character Level Information [O] . Longtu Zhang, Mamoru Komachi 2018

机译：使用子字符级信息的逻辑语言的神经机翻译

Using Sub-character Level Information for Neural Machine Translation of Logographic Languages

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅