In this paper we study the impact of using images to machine-translate user-generated e-commerce product listings. We study how a multi-modal Neural Machine Translation (NMT) model compares to two text-only ap proaches: a conventional state-of-the-art atten-tional NMT and a Statistical Machine Trans lation (SMT) model. User-generated product listings often do not constitute grammatical or well-formed sentences. More often than not, they consist of the juxtaposition of short phrases or keywords. We train our models end-to-end as well as use text-only and multi modal NMT models for re-ranking n-best lists generated by an SMT model. We qualita tively evaluate our user-generated training data also analyse how adding synthetic data im pacts the results. We evaluate our models quantitatively using BLEU and TER and find that (i) additional synthetic data has a general positive impact on text-only and multi-modal NMT models, and that (ii) using a multi-modal NMT model for re-ranking n-best lists im proves TER significantly across different n-best list sizes.
展开▼