首页> 美国卫生研究院文献>other >How to evaluate sentiment classifiers for Twitter time-ordered data?
【2h】

How to evaluate sentiment classifiers for Twitter time-ordered data?

机译:如何评估Twitter时间排序数据的情绪分类器?

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Social media are becoming an increasingly important source of information about the public mood regarding issues such as elections, Brexit, stock market, etc. In this paper we focus on sentiment classification of Twitter data. Construction of sentiment classifiers is a standard text mining task, but here we address the question of how to properly evaluate them as there is no settled way to do so. Sentiment classes are ordered and unbalanced, and Twitter produces a stream of time-ordered data. The problem we address concerns the procedures used to obtain reliable estimates of performance measures, and whether the temporal ordering of the training and test data matters. We collected a large set of 1.5 million tweets in 13 European languages. We created 138 sentiment models and out-of-sample datasets, which are used as a gold standard for evaluations. The corresponding 138 in-sample datasets are used to empirically compare six different estimation procedures: three variants of cross-validation, and three variants of sequential validation (where test set always follows the training set). We find no significant difference between the best cross-validation and sequential validation. However, we observe that all cross-validation variants tend to overestimate the performance, while the sequential methods tend to underestimate it. Standard cross-validation with random selection of examples is significantly worse than the blocked cross-validation, and should not be used to evaluate classifiers in time-ordered data scenarios.
机译:社交媒体正日益成为有关公众情绪的重要信息来源,涉及诸如选举,英国退欧,股票市场等问题。在本文中,我们重点关注Twitter数据的情感分类。构建情感分类器是标准的文本挖掘任务,但是在这里我们解决了如何正确评估它们的问题,因为目前尚无解决方法。情感类是有序的和不平衡的,Twitter生成了按时间顺序排列的数据流。我们解决的问题涉及用于获得性能指标的可靠估计的程序,以及训练和测试数据的时间顺序是否重要。我们用13种欧洲语言收集了150万条推文。我们创建了138个情感模型和样本外数据集,它们被用作评估的黄金标准。相应的138个样本内数据集用于凭经验比较六个不同的估算程序:交叉验证的三个变体和顺序验证的三个变体(其中测试集始终遵循训练集)。我们发现最佳交叉验证和顺序验证之间没有显着差异。但是,我们观察到所有交叉验证变量都倾向于高估性能,而顺序方法往往低估了性能。随机选择示例的标准交叉验证明显比封闭式交叉验证差,并且不应用于按时间排序的数据场景评估分类器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号