首页> 外国专利> Creating Taxonomies And Training Data For Document Categorization

Creating Taxonomies And Training Data For Document Categorization

机译:创建分类法和训练数据以进行文档分类

摘要

Methods, apparatus and systems are provided to generate from a set of training documents a set of training data and a set of features for a taxonomy of categories. In this generated taxonomy the degree of feature overlap among categories is minimized in order to optimize use with a machine-based categorizer. However, the categories still make sense to a human because a human makes the decisions regarding category definitions. In an example embodiment, for each category, a plurality of training documents selected using Web search engines is generated, the documents winnowed to produce a more refined set of training documents, and a set of features highly differentiating for that category within a set of categories (a supercategory) extracted. This set of training documents or differentiating features is used as input to a categorizer, which determines for a plurality of test documents the plurality of categories to which they best belong.
机译:提供了用于从一组训练文档中生成一组训练数据和一组用于类别分类的特征的方法,装置和系统。在此生成的分类法中,类别之间的特征重叠程度被最小化,以优化基于机器的分类器的使用。但是,类别仍然对人有意义,因为人可以做出有关类别定义的决策。在示例实施例中,对于每个类别,生成使用网络搜索引擎选择的多个训练文档,经过风选以生成一组更精细的训练文档,以及在一组类别中针对该类别高度区分的一组特征。 (一个超类别)提取。这套训练文档或区分特征用作分类器的输入,分类器为多个测试文档确定它们最好属于的多个类别。

著录项

  • 公开/公告号US2007185901A1

    专利类型

  • 公开/公告日2007-08-09

    原文格式PDF

  • 申请/专利权人 STEPHEN C. GATES;

    申请/专利号US20070734528

  • 发明设计人 STEPHEN C. GATES;

    申请日2007-04-12

  • 分类号G06F7/00;

  • 国家 US

  • 入库时间 2022-08-21 21:02:55

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号