To extract structured representations of newsworthy events from Twitter, unsuper vised models typically assume that tweets involving the same named entities and ex pressed using similar words are likely to belong to the same event. Hence, they group tweets into clusters based on the co occurrence patterns of named entities and topical keywords. However, there are two main limitations. First, they require the number of events to be known beforehand, which is not realistic in practical applica tions. Second, they don't recognise that the same named entity might be referred to by multiple mentions and tweets us ing different mentions would be wrongly assigned to different events. To over come these limitations, we propose a non-parametric Bayesian mixture model with word embeddings for event extraction, in which the number of events can be in ferred automatically and the issue of lex ical variations for the same named entity can be dealt with properly. Our model has been evaluated on three datasets with sizes ranging between 2,499 and over 60 million tweets. Experimental results show that our model outperforms the baseline approach on all datasets by 5-8% in F-measure.
展开▼