...
首页> 外文期刊>International journal of web information systems >A set of parameters for automatically annotating a Sentiment Arabic Corpus
【24h】

A set of parameters for automatically annotating a Sentiment Arabic Corpus

机译:一组用于自动注释阿拉伯语语料库的参数

获取原文
获取原文并翻译 | 示例
           

摘要

Purpose - This paper aims to propose an approach to automatically annotate a large corpus in Arabic dialect. This corpus is used in order to analyse sentiments of Arabic users on social medias. It focuses on the Algerian dialect, which is a sub-dialect of Maghrebi Arabic. Although Algerian is spoken by roughly 40 million speakers, few studies address the automated processing in general and the sentiment analysis in specific for Algerian. Design/methodology/approach - The approach is based on the construction and use of a sentiment lexicon to automatically annotate a large corpus of Algerian text that is extracted from Facebook. Using this approach allow to significantly increase the size of the training corpus without calling the manual annotation. The annotated corpus is then vectorized using document embedding (doc2vec), which is an extension of word embeddings (word2vec). For sentiments classification, the authors used different classifiers such as support vector machines (SVM), Naive Bayes (NB) and logistic regression (LR). Findings - The results suggest that NB and SVM classifiers generally led to the best results and MLP generally had the worst results. Further, the threshold that the authors use in selecting messages for the training set had a noticeable impact on recall and precision, with a threshold of 0.6 producing the best results. Using PV-DBOW led to slightly higher results than using PV-DM. Combining PV-DBOW and PV-DM representations led to slightly lower results than using PV-DBOW alone. The best results were obtained by the NB classifier with F1 up to 86.9 per cent. Originality/value - The principal originality of this paper is to determine the right parameters for automatically annotating an Algerian dialect corpus. This annotation is based on a sentiment lexicon that was also constructed automatically.
机译:目的-本文旨在提出一种自动注释阿拉伯方言大型语料库的方法。该语料库用于分析社交媒体上阿拉伯用户的情绪。它着重于阿尔及利亚方言,该方言是马格里比阿拉伯语的子方言。尽管大约有4000万演讲者说着阿尔及利亚语,但很少有研究针对一般的自动处理以及针对阿尔及利亚人的情感分析。设计/方法/方法-该方法基于情感词典的构建和使用,以自动注释从Facebook提取的大量阿尔及利亚文字语料库。使用这种方法可以显着增加训练语料库的大小,而无需调用手动注释。然后,使用文档嵌入(doc2vec)对带注释的语料库进行矢量化,该文档嵌入是单词嵌入(word2vec)的扩展。对于情感分类,作者使用了不同的分类器,例如支持向量机(SVM),朴素贝叶斯(NB)和逻辑回归(LR)。结果-结果表明,NB和SVM分类器通常导致最佳结果,而MLP通常具有最差结果。此外,作者在为训练集选择消息时使用的阈值对召回率和准确性产生了显着影响,阈值0.6产生了最佳结果。与使用PV-DM相比,使用PV-DBOW导致的结果略高。与单独使用PV-DBOW相比,将PV-DBOW和PV-DM表示相结合导致的结果略低。 NB分类器获得的最佳结果是F1高达86.9%。创意/价值-本文的主要创意是为自动注释阿尔及利亚方言语料确定正确的参数。该注释基于同样自动构建的情感词典。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号