Due to the scarcity of parallel training data for many language pairs, quasi-parallel or comparable training data provides an important alternative resource for training machine translation systems for such language pairs. Since comparable corpora are not of as high quality as manually annotated parallel data, using them for training can have a negative effect on the translation performance of an NMT model. We propose distillation as a remedy to effectively leverage comparable data where the training of a student model on combined clean and comparable data is guided by a teacher model trained on the high-quality, clean data only. Our experiments for Arabic-English, Chinese-English, and German-English translation demonstrate that distillation yields significant improvements compared to off-the-shelf use of comparable data and performs comparable to state-of-the-art methods for noise filtering.
展开▼