Disclosed is a method and apparatus for building a domain-specific language model for use in language processing applications, e.g., speech recognition. A reference language model is generated based on a relatively small seed corpus containing linguistic units relevant to the domain. An external corpus containing a large number of linguistic units is accessed. Using the reference language model, linguistic units which have a sufficient degree of relevance to the domain are extracted from the external corpus. The reference language model is then updated based on the seed corpus and the extracted linguistic units. The process may be repeated iteratively until the language model is of satisfactory quality. The language building technique may be further enhanced by combining it with mixture modeling or class-based modeling.
展开▼