首页> 外国专利> Creating a Training Data Set Based on Unlabeled Textual Data

Creating a Training Data Set Based on Unlabeled Textual Data

机译:基于未标记的文本数据创建训练数据集

摘要

A system and method are disclosed for obtaining a plurality of unlabeled text documents; obtaining an initial concept; obtaining keywords from a knowledge source based on the initial concept; scoring the plurality of unlabeled documents based at least in part on the initial keywords; determining a categorization of the documents based on the scores; performing a first feature selection and creating a first vector space representation of each document in a first category and a second category, the first and second categories based on the scores, the first vector space representation serving as one or more labels for an associated unlabeled textual document; and generating the training set including a subset of the obtained unlabeled textual documents, the subset of the obtained unlabeled documents including a documents belonging to the first category and documents belonging to the second category.
机译:公开了一种用于获得多个未标记的文本文档的系统和方法。获得初步概念;根据初始概念从知识源获取关键字;至少部分地基于初始关键字对多个未标记文档进行评分;根据分数确定文件的分类;执行第一特征选择并创建第一类别和第二类别中每个文档的第一向量空间表示,第一和第二类别基于得分,第一向量空间表示用作关联的未标记文本的一个或多个标记文件;生成包括所获得的未标记文本文档的子集的训练集,所获得的未标记文档的子集包括属于第一类别的文档和属于第二类别的文档。

著录项

  • 公开/公告号US2017060993A1

    专利类型

  • 公开/公告日2017-03-02

    原文格式PDF

  • 申请/专利权人 SKYTREE INC.;

    申请/专利号US201615253249

  • 发明设计人 NICK PENDAR;ZHUANG WANG;

    申请日2016-08-31

  • 分类号G06F17/30;G06N99;

  • 国家 US

  • 入库时间 2022-08-21 13:48:16

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号