首页> 外文期刊>Information retrieval >Using word embeddings in Twitter election classification
【24h】

Using word embeddings in Twitter election classification

机译:在Twitter选举分类中使用单词嵌入

获取原文
获取原文并翻译 | 示例
           

摘要

Word embeddings and convolutional neural networks (CNN) have attracted extensive attention in various classification tasks for Twitter, e.g. sentiment classification. However, the effect of the configuration used to generate the word embeddings on the classification performance has not been studied in the existing literature. In this paper, using a Twitter election classification task that aims to detect election-related tweets, we investigate the impact of the background dataset used to train the embedding models, as well as the parameters of the word embedding training process, namely the context window size, the dimensionality and the number of negative samples, on the attained classification performance. By comparing the classification results of word embedding models that have been trained using different background corpora (e.g. Wikipedia articles and Twitter microposts), we show that the background data should align with the Twitter classification dataset both in data type and time period to achieve significantly better performance compared to baselines such as SVM with TF-IDF. Moreover, by evaluating the results of word embedding models trained using various context window sizes and dimensionalities, we find that large context window and dimension sizes are preferable to improve the performance. However, the number of negative samples parameter does not significantly affect the performance of the CNN classifiers. Our experimental results also show that choosing the correct word embedding model for use with CNN leads to statistically significant improvements over various baselines such as random, SVM with TF-IDF and SVM with word embeddings. Finally, for out-of-vocabulary ( OOV ) words that are not available in the learned word embedding models, we show that a simple OOV strategy to randomly initialise the OOV words without any prior knowledge is sufficient to attain a good classification performance among the current OOV strategies (e.g. a random initialisation using statistics of the pre-trained word embedding models).
机译:词嵌入和卷积神经网络(CNN)在Twitter的各种分类任务中引起了广泛的关注,例如情绪分类。然而,在现有文献中尚未研究用于生成单词嵌入的配置对分类性能的影响。在本文中,使用旨在检测与选举相关的推文的Twitter选举分类任务,我们调查了用于训练嵌入模型的背景数据集的影响以及词嵌入训练过程的参数(即上下文窗口)阴性样本的大小,尺寸和数量,取决于获得的分类性能。通过比较已使用不同背景语料库(例如Wikipedia文章和Twitter微博)训练的词嵌入模型的分类结果,我们显示背景数据应在数据类型和时间段上与Twitter分类数据集保持一致,以实现更好的效果性能与采用TF-IDF的SVM等基准相比。此外,通过评估使用各种上下文窗口大小和维度训练的词嵌入模型的结果,我们发现较大的上下文窗口和维度大小对于提高性能是更可取的。但是,负样本数参数不会显着影响CNN分类器的性能。我们的实验结果还表明,选择适用于CNN的正确词嵌入模型会在各种基线(如随机,带有TF-IDF的SVM和带有词嵌入的SVM)上产生统计上显着的改进。最后,对于在学习的词嵌入模型中不可用的词汇外(OOV)词,我们表明,在没有任何先验知识的情况下随机初始化OOV词的简单OOV策略足以在其中获得良好的分类性能。当前的OOV策略(例如使用预先训练的词嵌入模型的统计信息进行随机初始化)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号