Much of aviation safety reporting data consists of structured data e.g., digital flight data or radar data. However, safety report narratives, which come in the form of unstructured text data, are indispensable for safety reporting. Structured data alone is inadequate to capture all of the details of an incident while narratives can and do represent a myriad of details in a form that is natural for analysts to work with. However, large-scale analysis of narratives comes with many challenges: 1) it is difficult to employ enough human experts to digest the continuous flow of new incident reports 2) authors of incident reports use many different terms to refer to the same semantic concept, which makes it more difficult to determine if a specific concept occurs in texts 3) authors often make spelling mistakes and 4) authors use a wide variety of abbreviations for terms, some of which are nonstandard. These challenges can be mitigated by the intelligent use of Natural Language Processing (NLP) and Deep Learning techniques to automate parts of narrative processing. Specifically, we show how to use ensembles of word2vec models to automatically find semantically similar terms within safety report corpora and how to use a combination of human expertise and these ensemble models to identify sets of similar terms with greater recall then either method alone. We also show an unsupervised method for comparing several word2vec models trained on the same data in order to estimate reasonable ranges of vector sizes to induce individual word2vec models. This method is based on measuring inter-model agreement on common word2vec similar terms.
展开▼