【24h】

Ternary encoding based feature extraction for binary text classification

机译:基于三进制编码的特征提取用于二进制文本分类

获取原文
获取原文并翻译 | 示例
           

摘要

A novel framework for termset based feature extraction is proposed for binary text classification. The proposed approach is based on the encoding of the terms within a termset. The ternary codes '+1' and '?1' are used to represent the class that the term supports, whereas '0' denotes no support to any of the classes. Four different encoding schemes are proposed where the term weights and the term occurrence probabilities in the positive and negative documents are used to define the ternary code of a given term. The ternary patterns are utilized to define novel features by splitting them into positive and negative codes where each code is treated as a different feature extractor. Use of the derived features individually and together with bag of words representation are both investigated. The histograms of the resultant features are also employed to study the improvements that can be achieved using a small number of additional features to augment bag of words representation. Experiments conducted on four benchmark datasets with different characteristics have shown that the proposed feature extraction framework provides significant improvements compared to the bag of words representation.
机译:提出了一种新的基于术语集的特征提取框架,用于二进制文本分类。所提出的方法基于术语集中术语的编码。三元代码“ +1”和“?1”用于表示该术语支持的类别,而“ 0”表示不支持任何类别。提出了四种不同的编码方案,其中使用正负文档中的术语权重和术语出现概率来定义给定术语的三进制编码。三元模式通过将新颖特征分为正码和负码来定义新颖特征,其中每个代码都被视为不同的特征提取器。分别研究了衍生特征的使用以及单词袋的表示。所得特征的直方图也用于研究使用少量附加特征来增强单词表示袋所能实现的改进。在具有不同特征的四个基准数据集上进行的实验表明,与袋式表示相比,所提出的特征提取框架提供了显着的改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号