In this paper, we have presented a series of algorithms and tools to process the large text corpus for building high performance statistical language model. Our purpose is that raw corpus as our input, the high accuracy and robust topic dependent language models can be got automatically. All the tools are based on three kernel technologies, which are developed by us. They are lexicons with tree structure, fuzzy training subset and topic change detection of text based on neural network.
展开▼