首页> 外国专利> METHOD, DEVICE AND COMPUTER SOFTWARE PRODUCT FOR FLEXIBLE LANGUAGE IDENTIFICATION ON THE TEXT BASIS

METHOD, DEVICE AND COMPUTER SOFTWARE PRODUCT FOR FLEXIBLE LANGUAGE IDENTIFICATION ON THE TEXT BASIS

机译:用于基于文本的灵活语言识别的方法,设备和计算机软件产品

摘要

1. A method for determining a text-based language, including:! receiving a record in a computer-readable text format; ! determining an alphabetical index for this entry for each of a plurality of languages; ! determination of the frequency indicator n-grams of this record for each of the many languages; and! determining, by means of a processor, a language associated with the recording based on a combination of an alphabet index and a n-gram frequency index. ! 2. The method according to claim 1, in which the definition of an indicator of the alphabet includes comparing the characters associated with the record with the alphabet of each language from many languages and creating an indicator for each language from many languages, and this indicator for each language from many languages is based on at least least partially due to the absence of one or more characters in the corresponding alphabet of the corresponding language from the set of languages for which the indicator is determined. ! 3. The method according to claim 1 or 2, in which the determination of the indicator of the frequency of n-grams for each language from many languages includes comparing the record with statistics of n-grams for each of the many languages. ! 4. The method according to claim 3, in which the record includes n characters, and comparing the record with statistics of n-grams includes determining the conditional probability of occurrence of the nth character of the record, provided that there are previous n-1 characters. ! 5. The method according to claim 3, further comprising assigning a start character and an end character to the first and last characters of the record, respectively, for use in matching with the corresponding start and end characters associated with the probability of each n-gram in n-gram statistics. ! 6. The method according to claim 1, also comprising comparing indicator a
机译:1.一种用于确定基于文本的语言的方法,包括:接收计算机可读文本格式的记录; !为多种语言中的每一种确定该条目的字母索引; !为多种语言中的每种语言确定该记录的频率指示器n克;和!借助于处理器,基于字母索引和n-gram频率索引的组合来确定与记录相关联的语言。 ! 2.根据权利要求1所述的方法,其中,所述字母的指示符的定义包括:将与所述记录相关联的字符与来自多种语言的每种语言的字母表进行比较;以及针对来自多种语言的每种语言创建指示符;以及对于许多语言中的每种语言而言,“语言”至少部分地是由于确定指示符的语言集合中相应语言的相应字母中不存在一个或多个字符而导致的。 ! 3.根据权利要求1或2所述的方法,其中,确定来自多种语言的每种语言的n元语法的频率的指示符的步骤包括:将所述记录与针对多种语言中的每种语言的n元语法的统计进行比较。 ! 4.根据权利要求3所述的方法,其中,所述记录包括n个字符,并且将所述记录与n元语法的统计进行比较包括:如果存在先前的n-1,则确定所述记录的第n个字符出现的条件概率字符。 ! 5.根据权利要求3所述的方法,还包括分别向所述记录的所述第一和最后字符分配开始字符和结束字符,以用于与与每个n-的概率相关联的相应开始和结束字符匹配。 n克统计中的克。 ! 6.根据权利要求1所述的方法,还包括比较指标a

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号