Bilingual parallel corpora, also know as bitexts, convey the same information in two different languages. This implies that when modelling bi-texts one can take advantage of the fact that there exists a relation between both texts; the text alignment task allow to establish such relationship. In this paper we propose different approaches that use words and biwords (pairs made of two words, each one from a different text) as representation sym-bolic units. The properties of these approaches are analysed from a statis-tical point of view and tested as a preprocessing step to general purpose compressors. The results obtained suggest interesting conclusions concerning the use of both words and biwords. When encoded models are used as com-pression boosters we achieve compression ratios improving state-of-the-art compressors up to 6.5 percentage points, being up to 40% faster.
展开▼