首页> 外文会议>India Confenence >Sentiment Analysis on 4GE WENHAO (Bengali Horoscope) Corpus
【24h】

Sentiment Analysis on 4GE WENHAO (Bengali Horoscope) Corpus

机译:sentiment analysis on 4GE wen ha O (Bengali horoscope) corpus

获取原文

摘要

Sentiment analysis in its simplest form is the classification of a piece of text into positive or negative class based on the polarity of the text. Horoscopes consist of future predictions for each of the twelve zodiac signs and are very popular in India. All major TV channels and newspapers publish their horoscope expert's predictions on a daily basis. These daily horoscopes are well suited for the task of sentiment analysis as they have a high percentage of strong sentiment bearing sentences. This work deals with sentiment analysis of Bengali daily horoscope. A corpus of 6000 sentences is created by crawling through the website of a leading Bengali newspaper's daily horoscope section. Each sentence is annotated with polarity (positive or negative) by a team of three independent annotators. A lexicon of 58 stop words is also created from the frequently occurring words in the corpus. A comparative analysis of five well known classification algorithms namely Na?ve Bayes, Support Vector Machines, k-Nearest Neighbours, Decision Tree and Random Forest is done. For each classification algorithm three different input features (unigram, bigram and trigram presence) are experimented with. Stop word removal and feature selection using information gain metric are also used. SVM with all unigram features neither removing stop words nor using information gain metric for feature selection proves to be the best combination producing an accuracy of 98.7%.
机译:最简单形式的情感分析是基于文本的极性分类为正面或负类的文本。占星术由12个十二生肖中的每一个的未来预测组成,在印度非常受欢迎。所有主要的电视频道和报纸每天都会发布他们的星座专家的预测。这些每天的星座非常适合情绪分析的任务,因为它们具有高比例的强烈情绪轴承句。这项工作涉及孟加拉每日星座的情感分析。通过领先的孟加拉报纸的每日占星部分的网站爬行来创建6000个句子的语料库。每个句子都是由三个独立注释器团队的极性(正面或负面)注释。还从语料库中的经常发生的单词创建了58个停止单词的词汇。完成了五种众所周知的分类算法的比较分析,即Na ve Bayes,支持向量机,K-COMPERT邻居,决策树和随机林。对于每个分类算法,使用三种不同的输入特征(Unigram,Bigram和Trigram存在)。还使用停止单词拆卸和使用信息增益度量的特征选择。 SVM与所有UNIGRAM功能既不删除止损单词也不使用特征选择的信息增益指标被证明是最佳组合,精度为98.7%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号