首页> 外文期刊>Knowledge-Based Systems >A new hybrid semi-supervised algorithm for text classification with class-based semantics
【24h】

A new hybrid semi-supervised algorithm for text classification with class-based semantics

机译:一种基于类语义的文本混合半监督新算法

获取原文
获取原文并翻译 | 示例
           

摘要

Vector Space Models (VSM) are commonly used in language processing to represent certain aspects of natural language semantics. Semantics of VSM comes from the distributional hypothesis, which states that words that occur in similar contexts usually have similar meanings. In our previous work, we proposed novel semantic smoothing kernels based on classspecific transformations. These kernels use class term matrices, which can be considered as a new type of VSM. By using the class as the context, these methods can extract class specific semantics by making use of word distributions both in documents and in different classes. In this study, we adapt two of these semantic classification approaches to build a novel and high performance semi-supervised text classification algorithm. These approaches include Helmholtz principle based calculation of term meanings in the context of classes for initial classification and a supervised term weighting based semantic kernel with Support Vector Machines (SVM) for the final classification model. The approach used in the first phase is especially good at learning with very small datasets, while the approach in the second phase is specifically good at eliminating noise in a relatively large and noisy training sets when building a classification model. Overall, as a semantic semi-supervised learning algorithm, our approach can effectively utilize abundant source of unlabeled instances to improve the classification accuracy significantly especially when the amount of labeled instances are limited. (C) 2016 Elsevier B.V. All rights reserved.
机译:向量空间模型(VSM)通常用于语言处理中,以表示自然语言语义的某些方面。 VSM的语义来自分布假设,该假设指出出现在相似上下文中的单词通常具有相似的含义。在我们以前的工作中,我们提出了基于类特定转换的新颖语义平滑内核。这些内核使用类术语矩阵,可以将其视为新型的VSM。通过使用类作为上下文,这些方法可以通过利用文档和不同类中的单词分布来提取类特定的语义。在这项研究中,我们采用了两种语义分类方法,以构建一种新颖且高性能的半监督文本分类算法。这些方法包括用于初始分类的类中基于Helmholtz原理的术语含义计算,以及用于最终分类模型的基于监督术语加权的语义内核和支持向量机(SVM)。在第一阶段中使用的方法特别擅长使用非常小的数据集进行学习,而在第二阶段中使用的方法特别擅长在建立分类模型时消除相对较大且嘈杂的训练集中的噪声。总体而言,作为一种语义半监督学习算法,我们的方法可以有效地利用大量未标记实例的来源,从而显着提高分类准确性,尤其是在标记实例数量有限的情况下。 (C)2016 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号