In a multilingual scenario, the classicalmonolingual text categorization problemcan be reformulated as a cross languageTC task, in which we have to cope withtwo or more languages (e.g. English andItalian). In this setting, the system istrained using labeled examples in a sourcelanguage (e.g. English), and it classifiesdocuments in a different target language(e.g. Italian).In this paper we propose a novel approachto solve the cross language textcategorization problem based on acquiringMultilingual Domain Models fromcomparable corpora in a totally unsupervisedway and without using any externalknowledge source (e.g. bilingual dictionaries).These Multilingual Domain Modelsare exploited to define a generalizedsimilarity function (I.e. a kernel function)among documents in different languages,which is used inside a Support Vector Machinesclassification framework. The resultsshow that our approach is a feasibleand cheap solution that largely outperformsa baseline.
展开▼