首页> 外文会议>International Conference on Information Science and Communication Technology >Urdu Sentiment Corpus (v1.0): Linguistic Exploration and Visualization of Labeled Dataset for Urdu Sentiment Analysis
【24h】

Urdu Sentiment Corpus (v1.0): Linguistic Exploration and Visualization of Labeled Dataset for Urdu Sentiment Analysis

机译:乌尔都语情绪语料库(V1.0):乌尔杜语情绪分析标记数据集的语言探索与可视化

获取原文

摘要

The significance of the labeled dataset is not obscure from artificial intelligence practitioners. We have seen much phenomenal work, in natural language processing, for many languages (like English, Chinese, and Arabic, etc.), due to the reason for the availability of substantial data. For the Urdu language, despite the third largest spoken language in the world, very little research work is shown; hence, it is adjudged as a ‘morphologically rich’ but ‘resource-poor’ language. Further, the researchers working on Urdu natural language processing are in a quandary due to the lack of availability of labeled/annotated datasets. This paper shares the data, “Urdu Sentiment Corpus” (USC), and insights therein, of Urdu tweets for the sentiment analysis and polarity detection. The dataset is consisting of tweets, such that it casts a political shadow and presents a competitive environment between two separate political parties versus the government of Pakistan. Overall, the dataset is comprising over 17, 185 tokens with 52% records as positive, and 48 % records as negative. This paper shares the visual insights (from document-level to word-level) into the textual similarities, manifold-learning, etc. In addition to it, this paper also presents a Part-of-Speech wise analysis and an unpretentious technique for the extraction of sentiment lexicons from the corpus.
机译:标签数据集的意义不是人工智能从业者的模糊。我们已经看到了许多语言处理,对于许多语言(如英语,中文和阿拉伯语等),我们已经看到了很多现象的工作,这是由于可用性数据的原因。对于乌尔都语语言,尽管世界第三大口语语言,但展示了很少的研究工作;因此,它被判定为“形态富裕”但“资源贫困”语言。此外,由于标记/注释数据集的可用性缺乏可用性,研究人员在乌尔都语自然语言处理中处于劣势。本文与乌尔都语推文共享数据,“URDU情绪语料库”(USC)和Inthinit,用于情感分析和极性检测。 DataSet由推文组成,使得它施放了一个政治影子,并在两个独立的政党与巴基斯坦政府之间提出了竞争性环境。总的来说,数据集包含超过17,185个令牌,其中52%的记录为正为阳性,48%的记录为负。本文将视觉​​见解(从文档级别到文档级别)分享到文本相似之处,多方面学习等。除此之外,本文还提出了一种言语典型的分析和一种不稳定的技术从语料库中提取情绪词典。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号