首页> 外国专利> System for chinese tokenization and named entity recognition

System for chinese tokenization and named entity recognition

机译:中文令牌化和命名实体识别系统

摘要

A system (100, 200) for tokenization and named entity recognition of ideographic language is disclosed. In the system, a word lattice is generated for a string of ideographic characters using finite state grammars (150) and a system lexicon (240). Segmented text is generated by determining word boundaries in the string of ideographic characters using the word lattice dependent upon a contextual language model (152A) and one or more entity language models (152B). One or more named entities is recognized in the string of ideographic characters using the word lattice dependent upon the contextual language model (152A) and the one or more entity language models (152B). The contextual language model (152A) and the one or more entity language models (152B) are each class-based language models. The lexicon (240) includes single ideographic characters, words, and predetermined features of the characters and words.
机译:公开了用于表意语言的标记化和命名实体识别的系统( 100、200 )。在该系统中,使用有限状态语法( 150 )和系统词典( 240 )为表意字符字符串生成单词晶格。通过使用取决于上下文语言模型( 152 A)和一个或多个实体语言模型( 152 )的单词晶格确定表意字符字符串中的单词边界来生成分段文本> B)。使用取决于上下文语言模型( 152 A)和一个或多个实体语言模型( 152 )的单词晶格,在表意字符字符串中识别一个或多个命名实体> B)。上下文语言模型( 152 A)和一个或多个实体语言模型( 152 B)都是基于类的语言模型。词典( 240 )包含单个表意字符,单词以及这些字符和单词的预定特征。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号