首页> 外国专利> METHOD AND SYSTEM FOR CREATING A DOMAIN-SPECIFIC TRAINING CORPUS FROM GENERIC DOMAIN CORPORA

METHOD AND SYSTEM FOR CREATING A DOMAIN-SPECIFIC TRAINING CORPUS FROM GENERIC DOMAIN CORPORA

机译:从通用域公司创建域特定训练语料库的方法和系统

摘要

A method (100) for generating a domain- specific training set, comprising: generating (130) a generic corpus comprising a plurality of tokenized documents, comprising: (i) parsing (132) a document retrieved from the generic corpus; (ii) preprocessing (134) the parsed document; (iii) tokenizing (136) the preprocessed document; and (iv) storing (138) the tokenized document in the generic corpus; generating (140) an ontology database of tokenized entries, comprising: (i) parsing (142) an ontology entry retrieved from an ontology; (ii) preprocessing (144) the parsed entry; (iii) tokenizing (146) the preprocessed entry; and (iv) storing (148) the tokenized entry in the ontology database; querying (150), using domain- specific tokenized entries from the ontology database, the tokenized documents in the generic corpus; identifying (160), based on the query, a plurality of tokenized documents specific to the domain; and storing (170), in a training set database, the identified tokenized documents as a training set specific to the domain.
机译:一种用于生成域专用训练集的方法(100),包括:生成(130)包括多个标记化文档的通用语料库,包括:(i)解析(132)从通用语料库检索的文档;以及(ii)预处理(134)解析的文档; (iii)标记(136)预处理过的文件; (iv)在通用语料库中存储(138)标记化文档;生成(140)标记化条目的本体数据库,包括:(i)解析(142)从本体检索的本体条目; (ii)对已解析的条目进行预处理(144); (iii)标记(146)预处理条目; (iv)将令牌化的条目存储(148)在本体数据库中;使用来自本体数据库的域特定的标记化条目,查询(150)通用语料库中的标记化文档;基于该查询,识别(160)该域特定的多个标记化文档;并且在训练集数据库中存储(170)所标识的标记化文档作为特定于该域的训练集。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号