Aligned parallel corpora have proved very useful in many natural language processing tasks, including statistical machine translation and word sense disambiguation. In this paper, we address issues related to current research in word alignment: coverage and resource requirements. In addressing these issues, we discuss the central problems of data sparseness and noise in the knowledge acquisition process and suggest an approach based on a bilingual machine-readable dictionary (MRD). We describe an MRD-based method called GenusAlign for word alignment, which relies on genus terms to cluster dictionary entries of headwords and translations. These Genus-based clusters are especially effective for alignment of suffixes pertaining to various semantic features, such as person, time, tool, etc. While not requiring a very large bilingual corpus, the GenusAlign algorithm nevertheless rivals corpus-based methods in coverage as well as precision.
展开▼