首页>
外国专利>
Creating a Training Data Set Based on Unlabeled Textual Data
Creating a Training Data Set Based on Unlabeled Textual Data
展开▼
机译:基于未标记的文本数据创建训练数据集
展开▼
页面导航
摘要
著录项
相似文献
摘要
A system and method are disclosed for obtaining a plurality of unlabeled text documents; obtaining an initial concept; obtaining keywords from a knowledge source based on the initial concept; scoring the plurality of unlabeled documents based at least in part on the initial keywords; determining a categorization of the documents based on the scores; performing a first feature selection and creating a first vector space representation of each document in a first category and a second category, the first and second categories based on the scores, the first vector space representation serving as one or more labels for an associated unlabeled textual document; and generating the training set including a subset of the obtained unlabeled textual documents, the subset of the obtained unlabeled documents including a documents belonging to the first category and documents belonging to the second category.
展开▼