Urdu Sentiment Corpus (v1.0): Linguistic Exploration and Visualization of Labeled Dataset for Urdu Sentiment Analysis

机译：乌尔都语情绪语料库（V1.0）：乌尔杜语情绪分析标记数据集的语言探索与可视化

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The significance of the labeled dataset is not obscure from artificial intelligence practitioners. We have seen much phenomenal work, in natural language processing, for many languages (like English, Chinese, and Arabic, etc.), due to the reason for the availability of substantial data. For the Urdu language, despite the third largest spoken language in the world, very little research work is shown; hence, it is adjudged as a ‘morphologically rich’ but ‘resource-poor’ language. Further, the researchers working on Urdu natural language processing are in a quandary due to the lack of availability of labeled/annotated datasets. This paper shares the data, “Urdu Sentiment Corpus” (USC), and insights therein, of Urdu tweets for the sentiment analysis and polarity detection. The dataset is consisting of tweets, such that it casts a political shadow and presents a competitive environment between two separate political parties versus the government of Pakistan. Overall, the dataset is comprising over 17, 185 tokens with 52% records as positive, and 48 % records as negative. This paper shares the visual insights (from document-level to word-level) into the textual similarities, manifold-learning, etc. In addition to it, this paper also presents a Part-of-Speech wise analysis and an unpretentious technique for the extraction of sentiment lexicons from the corpus.

机译：标签数据集的意义不是人工智能从业者的模糊。我们已经看到了许多语言处理，对于许多语言（如英语，中文和阿拉伯语等），我们已经看到了很多现象的工作，这是由于可用性数据的原因。对于乌尔都语语言，尽管世界第三大口语语言，但展示了很少的研究工作;因此，它被判定为“形态富裕”但“资源贫困”语言。此外，由于标记/注释数据集的可用性缺乏可用性，研究人员在乌尔都语自然语言处理中处于劣势。本文与乌尔都语推文共享数据，“URDU情绪语料库”（USC）和Inthinit，用于情感分析和极性检测。 DataSet由推文组成，使得它施放了一个政治影子，并在两个独立的政党与巴基斯坦政府之间提出了竞争性环境。总的来说，数据集包含超过17,185个令牌，其中52％的记录为正为阳性，48％的记录为负。本文将视觉见解（从文档级别到文档级别）分享到文本相似之处，多方面学习等。除此之外，本文还提出了一种言语典型的分析和一种不稳定的技术从语料库中提取情绪词典。

著录项

来源
《International Conference on Information Science and Communication Technology》|2020年|1 v.|共15页
会议地点
作者
Muhammad Yaseen Khan; Muhammad Suffian Nizami;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类半导体集成电路（固体电路）;
关键词
Corpus; Dataset; Computational Linguistics; Linguistics; Sentiment Analysis; Sentiment Classification; Urdu; Visualization;

机译：语料库;数据集;计算语言学;语言学;情绪分析;情绪分类;乌尔都语;可视化;

相似文献

外文文献
中文文献
专利

1. Creating sentiment lexicon for sentiment analysis in Urdu: The case of a resource‐poor language [J] . Asghar Muhammad Zubair, Sattar Anum, Khan Aurangzeb, Expert Systems . 2019,第3期

机译：在乌尔都语中创建用于情感分析的情感词典：资源贫乏的语言案例
2. Creating sentiment lexicon for sentiment analysis in Urdu: The case of a resource‐poor language [J] . Asghar Muhammad Zubair, Sattar Anum, Khan Aurangzeb, Expert Systems . 2019,第3期

机译：在乌尔都语中创造情绪词典的情绪分析：资源差的语言
3. A survey on sentiment analysis in Urdu: A resource-poor language [J] . Asad Khattak, Muhammad Zubair Asghar, Anam Saeed, Egyptian Informatics Journal . 2021,第1期

机译：乌尔都语情绪分析调查：资源匮乏的语言
4. Urdu Sentiment Corpus (v1.0): Linguistic Exploration and Visualization of Labeled Dataset for Urdu Sentiment Analysis [C] . Muhammad Yaseen Khan, Muhammad Suffian Nizami International Conference on Information Science and Communication Technology . 2020

机译：乌尔都语情感语料库（v1.0）：用于乌尔都语情感分析的标记数据集的语言探索和可视化
5. THE SYNTAX AND SEMANTICS OF QUESTIONS IN ENGLISH, HINDI AND URDU: A STUDY IN APPLIED LINGUISTICS. [D] . SIDDIQUI, AHMAD HASAN. 1977

机译：英语，印度语和乌尔都语中的问题的语法和语义：应用语言学研究。
6. Three datasets reporting unexpected events for everyday scenarios: Over 9000 events human-labelled for overall valence/sentiment topic category and relationship to the initial goal of the scenario [O] . Molly S. Quinn, Mark T. Keane 2021

机译：报告日常方案的意外事件的三个数据集：超过9000个事件用于整体价值/情绪主题类别和与情景的初始目标的关系
7. Urdu Sentiment Analysis with Deep Learning Methods [O] . Lal Khan, Ammar Amjad, Noman Ashraf, 2021

机译：乌尔都语情绪分析与深度学习方法

Urdu Sentiment Corpus (v1.0): Linguistic Exploration and Visualization of Labeled Dataset for Urdu Sentiment Analysis

摘要

著录项

相似文献

相关主题

期刊订阅