首页> 外文OA文献 >Word embedding composition for data imbalances in sentiment and emotion classification
【2h】

Word embedding composition for data imbalances in sentiment and emotion classification

机译:情感和情感分类中数据不平衡的词嵌入组合

摘要

Text classification often faces the problem of imbalanced training data. This is true in sentiment analysis and particularly prominent in emotion classification where multiple emotion categories are very likely to produce naturally skewed training data. Different sampling methods have been proposed to improve classification performance by reducing the imbalance ratio between training classes. However, data sparseness and the small disjunct problem remain obstacles in generating new samples for minority classes when the data are skewed and limited. Methods to produce meaningful samples for smaller classes rather than simple duplication are essential in overcoming this problem. In this paper, we present an oversampling method based on word embedding compositionality which produces meaningful balanced training data. We first use a large corpus to train a continuous skip-gram model to form a word embedding model maintaining the syntactic and semantic integrity of the word features. Then, a compositional algorithm based on recursive neural tensor networks is used to construct sentence vectors based on the word embedding model. Finally, we use the SMOTE algorithm as an oversampling method to generate samples for the minority classes and produce a fully balanced training set. Evaluation results on two quite different tasks show that the feature composition method and the oversampling method are both important in obtaining improved classification results. Our method effectively addresses the data imbalance issue and consequently achieves improved results for both sentiment and emotion classification.
机译:文本分类经常面临训练数据不平衡的问题。这在情感分析中是正确的,在情感分类中尤为突出,在情感分类中,多种情感类别很可能产生自然偏斜的训练数据。已经提出了不同的采样方法以通过减少训练课程之间的不平衡比来提高分类性能。但是,当数据偏斜和受到限制时,数据稀疏和小分离问题仍然是为少数群体生成新样本的障碍。为克服此问题,为较小的类而不是简单的重复生成有意义的样本的方法至关重要。在本文中,我们提出了一种基于词嵌入构图的过采样方法,该方法产生有意义的平衡训练数据。我们首先使用一个大型语料库来训练一个连续的跳过语法模型,以形成一个词嵌入模型,以维持词特征的句法和语义完整性。然后,采用基于递归神经张量网络的合成算法,基于词嵌入模型构造句子向量。最后,我们将SMOTE算法用作过采样方法,以生成少数群体的样本并生成完全平衡的训练集。在两个截然不同的任务上的评估结果表明,特征组合方法和过采样方法对于获得改进的分类结果都很重要。我们的方法有效地解决了数据不平衡的问题,因此在情感和情感分类方面均取得了改进的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号