首页> 外国专利> Sample text based language identification method and computer system

Sample text based language identification method and computer system

机译:基于样本文本的语言识别方法及计算机系统

摘要

The predominant language of a sample text is automatically identified using probability data that include N-gram probability data for at least one language and word probability data for at least one language. The N-gram probability data of a language indicate, for each N-gram, the probability that it occurs if the language is predominant. Similarly, the word probability data of a language indicate, for each word, the probability that it occurs if the language is predominant. The probability data are used to automatically obtain sample probability data for at least two languages. The sample probability data include N-gram probability information for at least one language and word probability information for at least one language. The sample probability data are used to automatically obtain language identifying data identifying the language whose sample probability data indicate the highest probability. The N-grams can be trigrams, while the words can be short words of no more than five characters. Some languages can have both trigram and word probabilities, while some can have only trigram probabilities.
机译:样本文本的主要语言是使用概率数据自动识别的,该概率数据包括至少一种语言的N-gram概率数据和至少一种语言的词概率数据。语言的N-gram概率数据表示每个N-gram如果以语言为主导的话出现该概率。类似地,一种语言的单词概率数据针对每个单词表示如果该语言占主导地位则该单词出现的概率。概率数据用于自动获取至少两种语言的样本概率数据。样本概率数据包括至少一种语言的N-gram概率信息和至少一种语言的词概率信息。样本概率数据用于自动获得语言识别数据,该语言识别数据标识其样本概率数据指示最高概率的语言。 N-gram可以是trigram,而单词可以是不超过五个字符的短单词。某些语言可以同时具有三字组和单词概率,而某些语言只能具有三字组概率。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号