Government documents must be reviewed to identify any sensitive informationudthey may contain, before they can be released to the public. However,udtraditional paper-based sensitivity review processes are not practical for reviewingudborn-digital documents. Therefore, there is a timely need for automatic sensitivityudclassification techniques, to assist the digital sensitivity review process.udHowever, sensitivity is typically a product of the relations between combinationsudof terms, such as who said what about whom, therefore, automatic sensitivityudclassification is a difficult task. Vector representations of terms, such as wordudembeddings, have been shown to be effective at encoding latent term featuresudthat preserve semantic relations between terms, which can also be beneficial toudsensitivity classification. In this work, we present a thorough evaluation of theudeffectiveness of semantic word embedding features, along with term and grammaticaludfeatures, for sensitivity classification. On a test collection of governmentuddocuments containing real sensitivities, we show that extending text classificationudwith semantic features and additional term n-grams results in significant improvementsudin classification effectiveness, correctly classifying 9.99% more sensitiveuddocuments compared to the text classification baseline.
展开▼