首页>
外国专利>
Creating taxonomies and training data for document categorization
Creating taxonomies and training data for document categorization
展开▼
机译:创建分类法和训练数据以进行文档分类
展开▼
页面导航
摘要
著录项
相似文献
摘要
Methods, apparatus and systems to generate from a set of training documents a set of training data and a set of features for a taxonomy of categories. In this generated taxonomy the degree of feature overlap among categories is minimized in order to optimize use with a machine-based categorizer. However, the categories still make sense to a human because a human makes the decisions regarding category definitions. In an example embodiment, for each category, a plurality of training documents selected using Web search engines is generated, the documents winnowed to produce a more refined set of training documents, and a set of features highly differentiating for that category within a set of categories (a supercategory) extracted. This set of training documents or differentiating features is used as input to a categorizer, which determines for a plurality of test documents the plurality of categories to which they best belong.
展开▼