首页> 外国专利> Feature reweighting in text classifier generation using unlabeled data

Feature reweighting in text classifier generation using unlabeled data

机译:使用未标记数据在文本分类器生成中重新重复

摘要

A mechanism is provided to implement a text classifier training augmentation mechanism for incorporating unlabeled data into the generation of a text classifier. For each term of a plurality of terms in each document of a plurality of documents in a set of unlabeled data, a term frequency value is determined. The term is normalized by dividing the term frequency value by a total number of terms in the document. An inverse document frequency (idf) value is determined for each term based on the term frequency value. A subset of terms is filtered from the plurality of terms based the determined idf values. The idf values for the remaining terms are transformed into feature weights. Terms from a set of labeled data are re-weighted based on the feature weights determined from the set of unlabeled data. The text classifier is then generated using the re-weighted labeled data.
机译:提供了一种机制来实现文本分类器培训增强机制,用于将未标记的数据结合到文本分类器的生成中。 对于在一组未标记的数据中的多个文档的每个文档中的多个术语中的每个项,确定术语频率值。 该术语通过将术语频率值除以文档中的总术语来标准化。 基于术语频率值确定逆文档频率(IDF)值。 基于所确定的IDF值,从多个术语过滤术语子集。 剩余术语的IDF值转换为特征权重。 根据一组标记数据的术语基于从该组的未标记数据确定的特征权重进行重新加权。 然后使用重新加权标记的数据生成文本分类器。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号