首页> 外国专利> AUTOMATIC LANGUAGE IDENTIFICATION SYSTEM FOR MULTILINGUAL OPTICAL CHARACTER RECOGNITION

AUTOMATIC LANGUAGE IDENTIFICATION SYSTEM FOR MULTILINGUAL OPTICAL CHARACTER RECOGNITION

机译:用于多语言光学字符识别的自动语言识别系统

摘要

1. A method for automatically determining one or more languages associated with text in a document, comprising the steps of: segmenting the document into a plurality of word tokens; forming at least one hypothesis of the characters in said word tokens; defining a dictionary for each one of plural languages; determining confidence factors with respect to said plural languages for said word hypotheses, which factors are based on whether the dictionary for a given language indicates whether a word hypothesis is found in that language; defining a plurality of regions in the document, each of which contains at least one word; determining a language confidence factor for each region, based upon the confidence factors associated with the words in the region; and clustering regions which have relatively high confidence factors for a given language to form a subzone that is identified with the given language. 2. The method of claim 1 wherein a hypothesis is formed only for words having a minimum length of at least two characters. 3. The method of claim 1 wherein said confidence factors for hypothesized words are weighted in accordance with the lengths of the hypothesized words. 4. The method of claim 1 further including the steps of determining a recognition probability for each hypothesis, and weighting said confidence factors in accordance with the recognition probabilities. 5. The method of claim 1 wherein said confidence factors for hypothesized words are weighted in accordance with the frequencies with which the hypothesized words appear in the respective languages. 6. The method of claim 1 wherein said initial hypothesis is formed by means of a classifier that is generic to each of said plural languages. 7. A method for automatically segmenting a document into homogenous language subzones, comprising the steps of: defining at least one zone in the document which contains a plurality of words; defining a dictionary for each one of plural languages; for each word in the zone, determining a confidence factor with respect to each of said plural languages, which factor is based on whether the respective dictionaries contain the word; identifying a zone language for the zone, based upon the confidence factors associated with the words in the zone; selecting a local region in the zone which contains at least one word; identifying a region language for the local region, based upon the confidence factor associated with the words in the region; determining whether the region language is the same as the zone language; and segregating the local region from other regions in the zone if its region language is not the same as the zone language 8. A method for automatically determining one or more languages associated with text in a document, comprising the steps of: segmenting the document into a plurality of zones containing regions of word tokens; forming at least one hypothesis of the characters in said word tokens; defining a dictionary for each one of plural languages; for each hypothesized word, determining which ones of said dictionaries contain the hypothesis for the word and determining a confidence value for each language; identifying a zone language for each zone, based upon the confidence values associated with the words in the zone; identifying a region language for each region, based upon the confidence values associated with the words in the region; designating the zone language as the region language if the confidence values associated with the words in the region are not sufficiently high; and clustering regions in a zone which have the same region language to form a subzone that is identified with a particular language. 9. The method of claim 8 wherein a hypothesis is formed only for words having a predetermined minimum number of characters greater than one. 10. The method of claim 8 further including the step of weighting said confidence values in accordance with the lengths of the hypothesized words. 11. The method of claim 8 further including the steps of determining a recognition probability for each hypothesis, and weighting said confidence values in accordance with the recognition probabilities. 12. The method of claim 8 wherein said initial hypothesis is formed by means of a classifier that is generic to each of said plural languages. 13. A method for automatically determining one or more languages associated with text in a document, comprising the steps of: segmenting the document into a plurality of word tokens; forming at least one hypothesis of the characters in said word tokens; for each word hypothesis, determining a confidence factor that indicates whether the word is contained in each of said plural languages; defining a plurality of regions in the document, each of which contains at least one word; determining a language confidence factor for each region, based upon the confidence factors associated with the words in the region; and clustering regions which have relatively high confidence factors for a given language to form a subzone that is identified with the given language. 14. A method for automatically segmenting a document into homogenous language subzones, comprising the steps of: defining at least one zone in the document which contains a plurality of words; for each word in the zone, determining a confidence factor that indicates whether the word is contained in each of said plural languages; identifying a zone language for the zone, based upon the confidence factors associated with the words in the zone; selecting a local region in the zone which contains at least one word; identifying a region language for the local region, based upon the confidence factor associated with the words in the region; determining whether the region language is the same as the zone language; and segregating the local region from other regions in the zone if its region language is not the same as the zone language.
机译:1。一种用于自动确定与文档中的文本相关联的一种或多种语言的方法,包括以下步骤:将文档分割成多个单词标记;形成所述单词标记中字符的至少一个假设;为多种语言中的每一种定义字典;确定关于所述词假设的关于所述多种语言的置信度,所述因数基于给定语言的词典是否指示是否在该语言中找到了词假设;在文档中定义多个区域,每个区域包含至少一个单词;基于与该区域中的单词相关联的置信度确定每个区域的语言置信度;以及对给定语言具有较高置信度的聚类区域,以形成用该给定语言标识的子区域。 2.根据权利要求1所述的方法,其中仅针对具有至少两个字符的最小长度的单词形成假设。 3.根据权利要求1所述的方法,其中,根据所述假设单词的长度对所述假设单词的置信度进行加权。 4.根据权利要求1所述的方法,还包括以下步骤:确定每个假设的识别概率,并根据所述识别概率对所述置信因子进行加权。 5.如权利要求1所述的方法,其特征在于,所述假设单词的置信度根据所述假设单词在相应语言中出现的频率加权。 6.根据权利要求1所述的方法,其中,所述初始假设是通过对于所述多种语言中的每一种通用的分类器形成的。 7.一种用于自动将文档分割成同质语言子区域的方法,包括以下步骤:在文档中定义至少一个包含多个单词的区域;为多种语言中的每一种定义字典;对于该区域中的每个单词,确定关于所述多种语言中的每一种的置信度因子,该因子基于各个词典是否包含该单词;根据与该区域中的单词相关的置信度,为该区域识别区域语言;在该区域中选择至少包含一个单词的局部区域;基于与该区域中的单词相关联的置信度,为本地区域识别区域语言;确定区域语言是否与区域语言相同;如果本地语言与区域语言不同,则将本地区域与区域中的其他区域隔离。8,一种自动确定与文档中文本相关的一种或多种语言的方法,包括以下步骤:将文档分段为多个包含单词标记区域的区域;对所述单词标记中的字符形成至少一种假设;为多种语言中的每一种定义字典;对于每个假设的单词,确定哪些所述词典包含该单词的假设,并确定每种语言的置信度值;基于与该区域中的单词相关联的置信度值,为每个区域标识一种区域语言;基于与该区域中的单词相关联的置信度值,为每个区域标识一种区域语言;如果与区域中的单词相关联的置信度值不够高,则将区域语言指定为区域语言;并且将区域中具有相同区域语言的区域聚类,以形成用特定语言标识的子区域。 9.根据权利要求8所述的方法,其中仅针对具有预定最小字符数大于一个的单词形成假设。 10.如权利要求8所述的方法,其特征在于,还包括根据假设单词的长度对所述置信度值加权的步骤。 11.如权利要求8所述的方法,其特征在于,还包括确定每个假设的识别概率,并根据所述识别概率对所述置信度值加权的步骤。 12.根据权利要求8所述的方法,其中,所述初始假设是通过对于所述多种语言中的每一种通用的分类器形成的。 13.一种用于自动确定与文档中的文本相关联的一种或多种语言的方法,包括以下步骤:将文档分段为多个单词标记;形成所述单词标记中字符的至少一个假设;对于每个单词假设,确定指示该单词是否包含在所述多种语言中的每一种中的置信度因子;在文档中定义多个区域,每个区域包含至少一个单词;确定每个地区的语言置信度,根据与该地区字词相关的置信度;以及对给定语言具有较高置信度的聚类区域,以形成用该给定语言标识的子区域。 14.一种用于自动将文档分割成同种语言子区域的方法,包括以下步骤:在文档中定义至少一个包含多个单词的区域;对于该区域中的每个单词,确定指示该单词是否包含在所述多种语言中的每一种中的置信度因子;根据与该区域中的单词相关的置信度,为该区域识别区域语言;在该区域中选择至少包含一个单词的局部区域;基于与该区域中的单词相关联的置信度,为本地区域识别区域语言;确定区域语言是否与区域语言相同;如果区域语言与区域语言不同,则将本地区域与区域中的其他区域隔离。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号