首页> 外国专利> Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building

Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building

机译:词边界概率估计,概率语言模型构建,假名汉字转换和未知词模型构建

摘要

Calculates a word n-gram probability with high accuracy in a situation where a first corpus), which is a relatively small corpus containing manually segmented word information, and a second corpus, which is a relatively large corpus, are given as a training corpus that is storage containing vast quantities of sample sentences. Vocabulary including contextual information is expanded from words occurring in first corpus of relatively small size to words occurring in second corpus of relatively large size by using a word n-gram probability estimated from an unknown word model and the raw corpus. The first corpus (word-segmented) is used for calculating n-grams and the probability that the word boundary between two adjacent characters will be the boundary of two words (segmentation probability). The second corpus (word-unsegmented), in which probabilistic word boundaries are assigned based on information in the first corpus (word-segmented), is used for calculating a word n-grams.
机译:在第一个语料库(它是一个包含手动分割的单词信息的相对较小的语料库)和第二个语料库(一个相对较大的语料库)被给出为训练语料库的情况下,可以高精度地计算单词n-gram概率是包含大量例句的存储。通过使用从未知单词模型和原始语料库估计的单词n-gram概率,将包括上下文信息的词汇从在较小尺寸的第一语料库中出现的单词扩展到在较大尺寸的第二语料库中出现的单词。第一个语料库(单词分段)用于计算n元语法,两个相邻字符之间的单词边界将成为两个单词的边界的概率(分段概率)。第二语料库(未分词)根据第一语料库中的信息分配了概率词边界(单词分词),用于计算单词n-gram。

著录项

  • 公开/公告号US2008228463A1

    专利类型

  • 公开/公告日2008-09-18

    原文格式PDF

  • 申请/专利权人 SHINSUKE MORI;DAISUKE TAKUMA;

    申请/专利号US20080126980

  • 发明设计人 SHINSUKE MORI;DAISUKE TAKUMA;

    申请日2008-05-26

  • 分类号G06F17/28;G06F17/21;

  • 国家 US

  • 入库时间 2022-08-21 20:15:21

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号