首页> 外国专利> Identifying cultural background from text

Identifying cultural background from text

机译:从文本中识别文化背景

摘要

Diaculture of text can be determined or analyzed by tokenizing words of the text according to a rule set to generate tokenized text, the rule set defining: a first set of grammatical types of words, which are words that are replaced with tokens that respectively indicate a grammatical type of a respective word, and a second set of grammatical types of words, which are words that are passed as tokens without changing. N-grams can be constructed from the tokenized text, each n-gram including one or more of consecutive tokens from the tokenized text. The n-grams can be compared to a training data set that corresponds to a known diaculture to obtain a comparison result that indicates how well the text matches the training data set for the known diaculture.
机译:可以通过根据规则集对文本的单词进行标记以生成标记化的文本来确定或分析文本的混音,该规则集定义:第一组语法类型的单词,这些单词是用分别表示单词的标记替换的单词各个单词的语法类型,以及第二组语法类型的单词,它们是作为令牌传递而没有更改的单词。可以从标记化文本构造N-gram,每个n-gram包括来自标记化文本的一个或多个连续标记。可以将n-gram与对应于已知透析的训练数据集进行比较,以获得比较结果,该比较结果指示文本与已知透析的训练数据集的匹配程度。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号