...
首页> 外文期刊>ACM transactions on Asian and low-resource language information processing >Using Sub-character Level Information for Neural Machine Translation of Logographic Languages
【24h】

Using Sub-character Level Information for Neural Machine Translation of Logographic Languages

机译:利用子字符级别信息来逻辑语言翻译

获取原文
获取原文并翻译 | 示例
           

摘要

Logographic and alphabetic languages (e.g., Chinese vs. English) have different writing systems linguistically. Languages belonging to the same writing system usually exhibit more sharing information, which can be used to facilitate natural language processing tasks such as neural machine translation (NMT). This article takes advantage of the logographic characters in Chinese and Japanese by decomposing them into smaller units, thus more optimally utilizing the information these characters share in the training of NMT systems in both encoding and decoding processes. Experiments show that the proposed method can robustly improve the NMT performance of both "logographic" language pairs (JA-ZH) and "logographic + alphabetic" (JA-EN and ZH-EN) language pairs in both supervised and unsupervised NMT scenarios. Moreover, as the decomposed sequences are usually very long, extra position features for the transformer encoder can help with the modeling of these long sequences. The results also indicate that, theoretically, linguistic features can be manipulated to obtain higher share token rates and further improve the performance of natural language processing systems.
机译:逻辑和字母语言(例如,中国与英语)对语言有不同的写作系统。属于相同的写作系统的语言通常展示了更多的共享信息,可用于促进自然语言处理任务,例如神经机器翻译(NMT)。本文利用中文和日语的逻辑字符通过将它们分解成较小的单位,从而更加最佳利用这些字符在编码和解码过程中的NMT系统训练中共享。实验表明,该方法可以鲁布布地改善“逻辑”语言对(JA-Zh)和“逻辑+字母”(JA-ZH和Zh-Zh)语言对的NMT性能,在监督和无监督的NMT场景中都有语言对。此外,由于分解序列通常是很长的,变压器编码器的额外位置特征可以帮助这些长序列的建模。结果还表明,理论上,理论上,可以操纵语言特征以获得更高的份额令牌速率并进一步提高自然语言处理系统的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号