首页> 外国专利> AUTOMATIC EXTRACTION OF A TRAINING CORPUS FOR A DATA CLASSIFIER BASED ON MACHINE LEARNING ALGORITHMS

AUTOMATIC EXTRACTION OF A TRAINING CORPUS FOR A DATA CLASSIFIER BASED ON MACHINE LEARNING ALGORITHMS

机译:基于机器学习算法的数据分类器训练语料库的自动提取

摘要

An iterative classifier for unsegmented electronic documents is based on machine learning algorithms. The textual strings in the electronic document are segmented using a composite dictionary that combines a conventional dictionary and an adaptive dictionary developed based on the context and nature of the electronic document. The classifier is built using a corpus of training and testing samples automatically extracted from the electronic document by detecting signatures for a set of pre-established classes for the textual strings. The classifier is further iteratively improved by automatically expanding the corpus of training and testing samples in real-time when textual strings in new electronic documents are processed and classified.
机译:未分段电子文档的迭代分类器基于机器学习算法。使用复合字典对电子文档中的文本字符串进行分段,该复合字典结合了常规字典和基于电子文档的上下文和性质开发的自适应字典。该分类器是使用训练和测试样本集构建的,该样本集是通过检测一组针对文本字符串的预先建立的类的签名而自动从电子文档中提取的。在处理和分类新电子文档中的文本字符串时,通过实时实时自动扩展训练和测试样本的语料库,进一步迭代地改进了分类器。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号