首页> 外文期刊>Expert Systems with Application >A multi-class SVM classification system based on learning methods from indistinguishable Chinese official documents
【24h】

A multi-class SVM classification system based on learning methods from indistinguishable Chinese official documents

机译:基于无法区分中文官方文档学习方法的多类SVM分类系统

获取原文
获取原文并翻译 | 示例
           

摘要

Support Vector Machines (SVM) has been developed for Chinese official document classification in One-against-All (OAA) multi-class scheme. Several data retrieving techniques including sentence segmentation, term weighting, and feature extraction are used in preprocess. We observe that most documents of which contents are indistinguishable make poor classification results. The traditional solution is to add misclassified documents to the training set in order to adjust classification rules. In this paper, indistinguishable documents are observed to be informative for strengthening prediction performance since their labels are predicted by the current model in low confidence. A general approach is proposed to utilize decision values in SVM to identify indistinguishable documents. Based on verified classification results and distinguishability of documents, four learning strategies that select certain documents to training sets are proposed to improve classification performance. Experiments report that indistinguishable documents are able to be identified in a high probability and are informative for learning strategies. Furthermore, LMID that adds both of misclassified documents and indistinguishable documents to training sets is the most effective learning strategy in SVM classification for large set of Chinese official documents in terms of computing efficiency and classification accuracy.
机译:支持向量机(SVM)已开发用于单对所有(OAA)多类方案的中文正式文档分类。预处理中使用了几种数据检索技术,包括句子分段,术语加权和特征提取。我们观察到,大多数内容难以区分的文档的分类结果很差。传统的解决方案是将分类错误的文档添加到训练集中,以调整分类规则。在本文中,由于当前模型以低置信度预测了它们的标签,因此观察到难以区分的文档对于增强预测性能具有指导意义。提出了一种通用方法来利用SVM中的决策值来识别无法区分的文档。基于已验证的分类结果和文档的可区分性,提出了将某些文档选择为训练集的四种学习策略,以提高分类性能。实验报告说,几乎没有区别的文档能够被识别,并且对学习策略很有帮助。此外,就计算效率和分类准确性而言,在大量中文官方文档的SVM分类中,将错误分类的文档和无法区分的文档都添加到训练集的LMID是最有效的学习策略。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号