A set of parameters for automatically annotating a Sentiment Arabic Corpus

Guellil Imane; Darwish Kareem; Azouaou Faical

首页> 外文期刊>International journal of web information systems >A set of parameters for automatically annotating a Sentiment Arabic Corpus

【24h】

A set of parameters for automatically annotating a Sentiment Arabic Corpus

机译：一组用于自动注释阿拉伯语语料库的参数

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Purpose - This paper aims to propose an approach to automatically annotate a large corpus in Arabic dialect. This corpus is used in order to analyse sentiments of Arabic users on social medias. It focuses on the Algerian dialect, which is a sub-dialect of Maghrebi Arabic. Although Algerian is spoken by roughly 40 million speakers, few studies address the automated processing in general and the sentiment analysis in specific for Algerian. Design/methodology/approach - The approach is based on the construction and use of a sentiment lexicon to automatically annotate a large corpus of Algerian text that is extracted from Facebook. Using this approach allow to significantly increase the size of the training corpus without calling the manual annotation. The annotated corpus is then vectorized using document embedding (doc2vec), which is an extension of word embeddings (word2vec). For sentiments classification, the authors used different classifiers such as support vector machines (SVM), Naive Bayes (NB) and logistic regression (LR). Findings - The results suggest that NB and SVM classifiers generally led to the best results and MLP generally had the worst results. Further, the threshold that the authors use in selecting messages for the training set had a noticeable impact on recall and precision, with a threshold of 0.6 producing the best results. Using PV-DBOW led to slightly higher results than using PV-DM. Combining PV-DBOW and PV-DM representations led to slightly lower results than using PV-DBOW alone. The best results were obtained by the NB classifier with F1 up to 86.9 per cent. Originality/value - The principal originality of this paper is to determine the right parameters for automatically annotating an Algerian dialect corpus. This annotation is based on a sentiment lexicon that was also constructed automatically.

机译：目的-本文旨在提出一种自动注释阿拉伯方言大型语料库的方法。该语料库用于分析社交媒体上阿拉伯用户的情绪。它着重于阿尔及利亚方言，该方言是马格里比阿拉伯语的子方言。尽管大约有4000万演讲者说着阿尔及利亚语，但很少有研究针对一般的自动处理以及针对阿尔及利亚人的情感分析。设计/方法/方法-该方法基于情感词典的构建和使用，以自动注释从Facebook提取的大量阿尔及利亚文字语料库。使用这种方法可以显着增加训练语料库的大小，而无需调用手动注释。然后，使用文档嵌入（doc2vec）对带注释的语料库进行矢量化，该文档嵌入是单词嵌入（word2vec）的扩展。对于情感分类，作者使用了不同的分类器，例如支持向量机（SVM），朴素贝叶斯（NB）和逻辑回归（LR）。结果-结果表明，NB和SVM分类器通常导致最佳结果，而MLP通常具有最差结果。此外，作者在为训练集选择消息时使用的阈值对召回率和准确性产生了显着影响，阈值0.6产生了最佳结果。与使用PV-DM相比，使用PV-DBOW导致的结果略高。与单独使用PV-DBOW相比，将PV-DBOW和PV-DM表示相结合导致的结果略低。 NB分类器获得的最佳结果是F1高达86.9％。创意/价值-本文的主要创意是为自动注释阿尔及利亚方言语料确定正确的参数。该注释基于同样自动构建的情感词典。

著录项

来源
《International journal of web information systems》 |2019年第5期|594-615|共22页
作者
Guellil Imane; Darwish Kareem; Azouaou Faical;
展开▼
作者单位

Laboratoire des Methodes de Conception des Systemes Ecole Nationale Superieure d'Informatique Oued-Smar Alger Algerie;

Qatar Computing Research Institute (QCRI) Doha Qatar;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Arabic sentiment analysis; Algerian dialect; Sentiment lexicon; Sentiment corpus; Doc2vec;

机译：阿拉伯语情绪分析;阿尔及利亚方言情感词典;情感语料库;Doc2vec;

相似文献

外文文献
中文文献
专利

1. Automatically annotating a five-billion-word corpus of Japanese blogs for sentiment and affect analysis [J] . Michal Ptaszynski, Rafal Rzepka, Kenji Araki, Computer speech and language . 2014,第1期

机译：自动注释50亿字的日语博客语料库，以进行情感和情感分析
2. A Morphologically Annotated Corpus and a Morphological Analyzer for Egyptian Arabic [J] . Amany Fashwan, Sameh Alansary Procedia Computer Science . 2021,第a期

机译：埃及阿拉伯语的形态学注释的语料库和形态分析仪
3. A continuous vocoder for statistical parametric speech synthesis and its evaluation using an audio-visual phonetically annotated Arabic corpus [J] . Mohammed Salah Al-Radhi, Omnia Abdo, Tamas Gabor Csapo, Computer speech and language . 2020,第Mara期

机译：用于统计参量语音合成的连续声码器及其使用视听注解的阿拉伯语语料库的评估
4. Automatically Annotating A Five-Billion-Word Corpus of Japanese Blogs for Affect and Sentiment Analysis [C] . Michal Ptaszynski, Rafal Rzepka, Kenji Araki, Workshop on computational approaches to subjectivity and sentiment analysis . 2012

机译：自动注释50亿字的日语Blog语料库以进行情感和情感分析
5. Annotating a corpus of biomedical research texts: Two models of rhetorical analysis. [D] . White, Barbara Ellen. 2010

机译：注释生物医学研究文献集：修辞分析的两种模型。
6. Opinion: Strategy of Semi-Automatically Annotating a Full-Text Corpus of Genomics Informatics [O] . Hyun-Seok Park 2018

机译：意见：半自动注释基因组和信息学全文语料库的策略
7. Automatically annotating a five-billion-word corpus of Japanese blogs for sentiment and affect analysis [O] . Ptaszynski, Michal, Rzepka, Rafal, Araki, Kenji, 2014

机译：自动注释50亿字的日语博客语料，以进行情感和情感分析

A set of parameters for automatically annotating a Sentiment Arabic Corpus

摘要

著录项

相似文献

相关主题

期刊订阅