首页> 外国专利> METHOD AND SYSTEM FOR CREATING A DOMAIN-SPECIFIC TRAINING CORPUS FROM GENERIC DOMAIN CORPORA

METHOD AND SYSTEM FOR CREATING A DOMAIN-SPECIFIC TRAINING CORPUS FROM GENERIC DOMAIN CORPORA

机译：从通用域公司创建域特定训练语料库的方法和系统

页面导航

摘要
著录项
相似文献

摘要

A method (100) for generating a domain- specific training set, comprising: generating (130) a generic corpus comprising a plurality of tokenized documents, comprising: (i) parsing (132) a document retrieved from the generic corpus; (ii) preprocessing (134) the parsed document; (iii) tokenizing (136) the preprocessed document; and (iv) storing (138) the tokenized document in the generic corpus; generating (140) an ontology database of tokenized entries, comprising: (i) parsing (142) an ontology entry retrieved from an ontology; (ii) preprocessing (144) the parsed entry; (iii) tokenizing (146) the preprocessed entry; and (iv) storing (148) the tokenized entry in the ontology database; querying (150), using domain- specific tokenized entries from the ontology database, the tokenized documents in the generic corpus; identifying (160), based on the query, a plurality of tokenized documents specific to the domain; and storing (170), in a training set database, the identified tokenized documents as a training set specific to the domain.

机译：一种用于生成域专用训练集的方法（100），包括：生成（130）包括多个标记化文档的通用语料库，包括：（i）解析（132）从通用语料库检索的文档;以及（ii）预处理（134）解析的文档; （iii）标记（136）预处理过的文件; （iv）在通用语料库中存储（138）标记化文档;生成（140）标记化条目的本体数据库，包括：（i）解析（142）从本体检索的本体条目; （ii）对已解析的条目进行预处理（144）; （iii）标记（146）预处理条目; （iv）将令牌化的条目存储（148）在本体数据库中;使用来自本体数据库的域特定的标记化条目，查询（150）通用语料库中的标记化文档;基于该查询，识别（160）该域特定的多个标记化文档;并且在训练集数据库中存储（170）所标识的标记化文档作为特定于该域的训练集。

著录项

公开/公告号WO2020109277A1

专利类型
公开/公告日2020-06-04

原文格式PDF
申请/专利权人 KONINKLIJKE PHILIPS N.V.;TRUSTEES OF BOSTON UNIVERSITY;
展开▼

申请/专利号WO2019EP82519
发明设计人 ZHU HENGHUI;TAHMASEBI MARAGHOOSH AMIR MOHAMMAD;PASCHALIDIS IOANNIS;
展开▼

申请日2019-11-26
分类号G06F16/33;
国家 WO
入库时间 2022-08-21 11:10:55

相似文献

专利
外文文献
中文文献