首页>
外国专利>
AUTOMATIC LANGUAGE IDENTIFICATION SYSTEM FOR MULTILINGUAL OPTICAL CHARACTER RECOGNITION
AUTOMATIC LANGUAGE IDENTIFICATION SYSTEM FOR MULTILINGUAL OPTICAL CHARACTER RECOGNITION
展开▼
机译:用于多语言光学字符识别的自动语言识别系统
展开▼
页面导航
摘要
著录项
相似文献
摘要
1. A method for automatically determining one or more languages associated with text in a document, comprising the steps of: segmenting the document into a plurality of word tokens; forming at least one hypothesis of the characters in said word tokens; defining a dictionary for each one of plural languages; determining confidence factors with respect to said plural languages for said word hypotheses, which factors are based on whether the dictionary for a given language indicates whether a word hypothesis is found in that language; defining a plurality of regions in the document, each of which contains at least one word; determining a language confidence factor for each region, based upon the confidence factors associated with the words in the region; and clustering regions which have relatively high confidence factors for a given language to form a subzone that is identified with the given language. 2. The method of claim 1 wherein a hypothesis is formed only for words having a minimum length of at least two characters. 3. The method of claim 1 wherein said confidence factors for hypothesized words are weighted in accordance with the lengths of the hypothesized words. 4. The method of claim 1 further including the steps of determining a recognition probability for each hypothesis, and weighting said confidence factors in accordance with the recognition probabilities. 5. The method of claim 1 wherein said confidence factors for hypothesized words are weighted in accordance with the frequencies with which the hypothesized words appear in the respective languages. 6. The method of claim 1 wherein said initial hypothesis is formed by means of a classifier that is generic to each of said plural languages. 7. A method for automatically segmenting a document into homogenous language subzones, comprising the steps of: defining at least one zone in the document which contains a plurality of words; defining a dictionary for each one of plural languages; for each word in the zone, determining a confidence factor with respect to each of said plural languages, which factor is based on whether the respective dictionaries contain the word; identifying a zone language for the zone, based upon the confidence factors associated with the words in the zone; selecting a local region in the zone which contains at least one word; identifying a region language for the local region, based upon the confidence factor associated with the words in the region; determining whether the region language is the same as the zone language; and segregating the local region from other regions in the zone if its region language is not the same as the zone language 8. A method for automatically determining one or more languages associated with text in a document, comprising the steps of: segmenting the document into a plurality of zones containing regions of word tokens; forming at least one hypothesis of the characters in said word tokens; defining a dictionary for each one of plural languages; for each hypothesized word, determining which ones of said dictionaries contain the hypothesis for the word and determining a confidence value for each language; identifying a zone language for each zone, based upon the confidence values associated with the words in the zone; identifying a region language for each region, based upon the confidence values associated with the words in the region; designating the zone language as the region language if the confidence values associated with the words in the region are not sufficiently high; and clustering regions in a zone which have the same region language to form a subzone that is identified with a particular language. 9. The method of claim 8 wherein a hypothesis is formed only for words having a predetermined minimum number of characters greater than one. 10. The method of claim 8 further including the step of weighting said confidence values in accordance with the lengths of the hypothesized words. 11. The method of claim 8 further including the steps of determining a recognition probability for each hypothesis, and weighting said confidence values in accordance with the recognition probabilities. 12. The method of claim 8 wherein said initial hypothesis is formed by means of a classifier that is generic to each of said plural languages. 13. A method for automatically determining one or more languages associated with text in a document, comprising the steps of: segmenting the document into a plurality of word tokens; forming at least one hypothesis of the characters in said word tokens; for each word hypothesis, determining a confidence factor that indicates whether the word is contained in each of said plural languages; defining a plurality of regions in the document, each of which contains at least one word; determining a language confidence factor for each region, based upon the confidence factors associated with the words in the region; and clustering regions which have relatively high confidence factors for a given language to form a subzone that is identified with the given language. 14. A method for automatically segmenting a document into homogenous language subzones, comprising the steps of: defining at least one zone in the document which contains a plurality of words; for each word in the zone, determining a confidence factor that indicates whether the word is contained in each of said plural languages; identifying a zone language for the zone, based upon the confidence factors associated with the words in the zone; selecting a local region in the zone which contains at least one word; identifying a region language for the local region, based upon the confidence factor associated with the words in the region; determining whether the region language is the same as the zone language; and segregating the local region from other regions in the zone if its region language is not the same as the zone language.
展开▼