首页> 外文会议>International Conference on Behavioral, Economic, and Socio-Cultural Computing >Latent Semantic Analysis Boosted Convolutional Neural Networks for Document Classification
【24h】

Latent Semantic Analysis Boosted Convolutional Neural Networks for Document Classification

机译:潜在语义分析促进卷积神经网络进行文档分类

获取原文

摘要

Convolutional neural networks (CNNs) have been shown to be effective in document classification tasks. CNNs can be setup using various architectures with many different parameter settings, which may make them difficult to implement. For many document classification tasks, data transformed with ngrams (typically using uni, bi, and trigrams) and term-frequency inverse-document-frequency (TFIDF) weighting are still considered effective baseline models when used with linear classifiers such as logistic regression, especially in smaller datasets with less than 500K observations. A parsimonious CNN baseline model for sentiment classification should replicate the easy use of linear methods. In this study, we introduce a Latent Semantic Analysis (LSA) based CNN model, in which natively trained LSA word vectors are used as input into parallel 1-dimensional convolutional layers (1D-CNNs). The LSA word vector model is obtained by applying singular value decomposition (SVD) on the data transformed by a unigram and TFIDF weighting. Thus, the convolutional layers are designed with window sizes that are best suited for LSA word vectors. This parsimonious LSA -based CNN model exceeds the accuracy of all linear classifiers utilizing ngrams with TFIDF on all analyzed datasets, with average improvement of 0.73% by the top performing LSA-based CNN models. This may be due to the fact that CNNs are better adept at capturing word relationships in phrases and sentences that are not necessarily in the training corpus. Furthermore, the LSA-based CNN model exceeds the performance of word2vec-based CNN models as well. Thus, the success of LSA-based CNNs may potentiate their use as a baseline in classification tasks alongside linear models. Also, we provide guiding principles to simplify the application of LSA-based CNNs in document classification tasks.
机译:卷积神经网络(CNN)已被证明在文档分类任务中是有效的。可以使用具有许多不同参数设置的各种体系结构来设置CNN,这可能会使它们难以实现。对于许多文档分类任务,当与线性分类器(例如逻辑回归)一起使用时,使用ngram(通常使用uni,bi和trigram)和项频逆文档频率(TFIDF)权重转换的数据仍被视为有效的基线模型。在少于500K观测值的较小数据集中。用于情感分类的简约CNN基线模型应复制线性方法的简便用法。在这项研究中,我们介绍了一种基于潜在语义分析(LSA)的CNN模型,其中使用经过本地训练的LSA词向量作为并行一维卷积层(1D-CNN)的输入。通过将奇异值分解(SVD)应用于通过单字组合和TFIDF加权的数据,可以获得LSA字向量模型。因此,将卷积层设计为最适合LSA字向量的窗口大小。这种基于LSA的简约CNN模型在所有分析的数据集上均利用ngram和TFIDF超越了所有线性分类器的准确性,性能最高的基于LSA的CNN模型平均提高了0.73%。这可能是由于CNN更擅长捕获训练语料库中不一定包含的短语和句子中的单词关系。此外,基于LSA的CNN模型也超过了基于word2vec的CNN模型的性能。因此,基于LSA的CNN的成功可能会增强它们在线性模型之外的分类任务中作为基准的使用。此外,我们提供了指导原则,以简化基于LSA的CNN在文档分类任务中的应用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号