首页>
外国专利>
METHOD AND SYSTEM FOR CREATING A DOMAIN-SPECIFIC TRAINING CORPUS FROM GENERIC DOMAIN CORPORA
METHOD AND SYSTEM FOR CREATING A DOMAIN-SPECIFIC TRAINING CORPUS FROM GENERIC DOMAIN CORPORA
展开▼
机译:从通用域公司创建域特定训练语料库的方法和系统
展开▼
页面导航
摘要
著录项
相似文献
摘要
A method (100) for generating a domain- specific training set, comprising: generating (130) a generic corpus comprising a plurality of tokenized documents, comprising: (i) parsing (132) a document retrieved from the generic corpus; (ii) preprocessing (134) the parsed document; (iii) tokenizing (136) the preprocessed document; and (iv) storing (138) the tokenized document in the generic corpus; generating (140) an ontology database of tokenized entries, comprising: (i) parsing (142) an ontology entry retrieved from an ontology; (ii) preprocessing (144) the parsed entry; (iii) tokenizing (146) the preprocessed entry; and (iv) storing (148) the tokenized entry in the ontology database; querying (150), using domain- specific tokenized entries from the ontology database, the tokenized documents in the generic corpus; identifying (160), based on the query, a plurality of tokenized documents specific to the domain; and storing (170), in a training set database, the identified tokenized documents as a training set specific to the domain.
展开▼