Chinese abbreviations are widely used inthe modern Chinese texts. They are aspecial form of unknown words, includingmany named entities. This results indifficulty for correct Chinese processing.In this study, the Chinese abbreviationproblem is regarded as an error recoveryproblem in which the suspect root wordsare the "errors" to be recovered from a setof candidates. Such a problem is mappedto an HMM-based generation model forboth abbreviation identification and rootword recovery, and is integrated as part ofa unified word segmentation model whenthe input extends to a complete sentence.Two major experiments are conducted totest the abbreviation models. In the firstexperiment, an attempt is made to guessthe abbreviations of the root words. Anaccuracy rate of 72% is observed. Incontrast, a second experiment isconducted to guess the root words fromabbreviations. Some submodels couldachieve as high as 51% accuracy with thesimple HMM-based model. Somequantitative observations against heuristicabbreviation knowledge about Chineseare also observed.
展开▼