首页> 外文会议>International Conference on Information Technology Systems and Innovation >Comparison on the Rule based Method and Statistical based Method on Emotion Classification for Indonesian Twitter Text
【24h】

Comparison on the Rule based Method and Statistical based Method on Emotion Classification for Indonesian Twitter Text

机译:基于规则的方法与统计学方法对印度尼西亚推特文本的情感分类比较

获取原文
获取外文期刊封面目录资料

摘要

In this study, we conducted experiments on emotion classification of Indonesian Twitter text. To conduct such experiments, we built a corpus of labeled Twitter data with size of 7622 Twitter text taken from 69 Twitter accounts, manually labeled by 5 native speakers. We used 6 basic emotion labels (angry, disgust, fear, joy, sad, surprise) and add one label of neutral emotion class. Here, we compared a rule based method with a statistical based method. In the rule based method, we employed the existing Synesketch algorithm with two types of emotion word list: a manually written and a translated WordNet-Affect list. In the statistical based method, we employed SVM (Support Vector Machine) algorithm with unigram feature and feature selection algorithms of Information Gain and Minimum Frequency. Other than a pure statistical based method, we also employed the manually built emotion word list in the SVM based classification. In the text pre-processing, we compared several methods such as the normalization, emotion conversion, stop words removal, number removal, and a one-character token removal. The experimental results showed that the statistical based method result of 71.740% accuracy score is higher than the rule based method of 63.172% accuracy score. To enhance the accuracy, we employed SMOTE in order to handle the imbalanced data and achieved best result with the f-measure of 83.203%. In another experiment, we combined the pure statistical method with the rule based method by employing the manually word list into the classification features. The f-measure for this experiment has only reached 81.592%.
机译:在这项研究中,我们对印度尼西亚推特文本的情感分类进行了实验。要进行此类实验,我们建立了一个标记的Twitter数据的语料库,大小为7622 Twitter文本,从69个Twitter帐户中拍摄,由5名母语人员手动标记。我们使用了6个基本情感标签(愤怒,厌恶,恐惧,喜悦,悲伤,惊喜),并添加一个中性情感课的一个标签。在这里,我们将基于统计的方法进行了比较了基于规则的方法。在基于规则的方法中,我们使用具有两种类型的情感字列表的现有Synesketch算法:手动写入和翻译的Wordnet-Checil流列表。在基于统计的方法中,我们采用了具有Unigram特征的SVM(支持向量机)算法,以及信息增益和最小频率的特征选择算法。除了纯粹的统计方法之外,我们还将手动构建的情绪单词列表中的基于SVM的分类中使用。在文本预处理中,我们比较了几种方法,如归一化,情绪转换,停止单词删除,数字删除和一个字符的令牌拆卸。实验结果表明,基于统计的方法结果为71.740%的准确度分数高于基于规则的方法,精度得分为63.172%。为了提高准确性,我们雇用了SMOTE以处理不平衡的数据,并通过83.203%的F测量来实现最佳结果。在另一个实验中,我们通过使用手动字列表进入分类功能来将纯统计方法与规则的方法组合起来。该实验的F措施仅达到81.592%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号