We consider cross- and multilingual text classification approaches to the identification of online registers (genres), i.e. text varieties with specific situational characteristics. Register is arguably the most important predictor of linguistic variation, and register information could improve the potential of online data for many applications. We introduce the Finnish Corpus of Online REgisters (FinCORE), the first manually annotated non-English corpus of online registers featuring the full range of linguistic variation found online. The data set consists of 2,237 Finnish documents and follows the register taxonomy developed for the Corpus of Online Registers of English (CORE), the largest manually annotated language collection of online registers. Using CORE and FinCORE data, we demonstrate the feasibility of cross-lingual register identification using a simple approach based on convo-lutional neural networks and multilingual word embeddings. We further find that register identification results can be improved through multilingual training even when a substantial number of annotations is available in the target language.
展开▼