This study involves automatically identifying the sociolinguistic characteristics of fictional characters in plays by analyzing their written "speech". We discuss three binary classification problems: predicting the characters' gender (male vs. female), age (young vs. old), and socio-economic standing (upper-middle class vs. lower class). The text corpus used is an annotated collection of August Strind-berg and Henrik Ibsen plays, translated into English, which are in the public domain. These playwrights were chosen for their known attention to relevant socio-economic issues in their work. Linguistic and textual cues are extracted from the characters' lines (turns) for modeling purposes. We report on the dataset as well as the performance and important features when predicting each of the sociolinguistic characteristics, comparing intra- and inter-author testing.
展开▼