Chinese news texts often contain a number of abbreviations without explicitly defining their full-forms. Therefore, expanding abbreviations to their original full-forms plays an important role in improving accuracy of the information extraction and retrieval systems for Chinese. In this paper, we present a hybrid approach to automatic expansion of abbreviations in Chinese news texts. Generally, Chinese abbreviations are produced from their original full-forms via reduction, elimination or generalization. To ensure every abbreviation can successfully be expanded, each abbreviation under expansion is assumed to be created by these three methods, respectively. Based on this assumption, a mapping table between shortened words and their matrix words, and a dictionary of short-form/full-form pairs are used to generate all possible expansions for abbreviations. For an ambiguous abbreviation with mutiple expansion candidates, then hidden Markov models are employed to rank all its expansion candidates and select a proper one with the maximum score. In order to further improve expansion performance, some linguistic knowledge like discourse information and abbreviation patterns are utilized to correct possible expansion errors. Evaluation on an abbreviation-expanded corpus built from the Peking University Corpus showed that our approach can achieve 86.3% and 83.8% on average in precision and recall respectively for various types of abbreviations in Chinese news texts.
展开▼