Word segmentation is an essential process in Chinese information processing. Although related researches were reported and made progresses, the Unknown Named Entity (UNE) problem in segmentation is not fully solved. This usually degrades the accuracy of segmentation in general. In this paper, a model to identify UNEs for improving the overall performance of the segmentation is presented. In order to capture the NE information, functions of characters or words are defined with tags. In addition, useful surrounding contexts are collected from a corpus and used as features. The model is constructed based on Maximum Entropy to handle the UNE identification as tagging problem. Empirical experiments show that the overall accuracy of the segmentation is improved after integrating the UNE identification module into the word segmenter.
展开▼