首页> 外国专利> Method of identifying the language of a textual passage using short word and/or n-gram comparisons

Method of identifying the language of a textual passage using short word and/or n-gram comparisons

机译:使用短词和/或n-gram比较识别文本段落语言的方法

摘要

A method and system identifying the language of a textual passage is disclosed. The method and system includes parsing the textual passage into n-grams and assigning an initial weight to each n-gram, and adjusting the weight initially assigned to a word or n-gram parsed from the textual passage. The initially assigned weight is adjusted in a manner proportionate to the inverse of the number of languages within which such words or n-grams appear. Reducing the weight assigned to such words or n-grams diminishes—without completely eliminating—their importance in comparison to other words or n-grams parsed from the same textual passage when determining the language of a passage. The method and system of the present invention appropriately weighs the short words or n-grams common to multiple languages without affecting the short words or n-grams that are uncommon to several languages.
机译:公开了一种识别文本段落的语言的方法和系统。该方法和系统包括:将文本段落解析为n-gram,并且为每个n-gram分配初始权重;以及调整最初分配给从文本段落解析的单词或n-gram的权重。初始分配的权重以与出现此类单词或n-gram的语言数量成反比的方式进行调整。与确定相同段落的语言时相比,减轻或不完全消除分配给此类单词或n-gram的权重与从同一文本段落解析的其他单词或n-gram相比,其重要性有所降低。本发明的方法和系统适当地加权多种语言共有的短词或n元语法词,而不影响几种语言中不常见的短词或n元语法词。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号