首页> 外文期刊>Computer speech and language >Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM
【24h】

Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM

机译:使用CRF和BI-LSTM的阿拉伯语推文的词典标记

获取原文
获取原文并翻译 | 示例

摘要

Over the past few years, Twitter has experienced massive growth and the volume of its online content has increased rapidly. This content has been a rich source for several studies that focused on natural language processing (NLP) research. However, Twitter data pose numerous challenges and obstacles to NLP tasks. For the English language, Twitter has an NLP tool that provides tweet-specific NLP tasks, which present significant opportunities for English NLP research and applications. Part-of-speech (POS) tagging for English tweets is one of the tasks that is offered and facilitated by such a tool. In contrast, only a few attempts have been made to develop POS taggers for Arabic content on Twitter. In this paper, we consider POS tagging, which is one of the NLP tasks that directly affects the performance of other subsequent text processing tasks. We introduce three manually annotated datasets for the POS tagging of Arabic tweets: the 'Mixed,' 'MSA,' and 'GLF datasets with 3000, 1000, and 1000 Arabic tweets, respectively. In addition, we present an exploratory analysis of the behavior of using hashtags in Arabic tweets, which is a phenomenon that affects the task of POS tagging. We also present two supervised POS taggers that are developed based on two approaches: Conditional Random Fields and Bidirectional Long Short-Term Memory (Bi-LSTM) models. We conclude that the Bi-LSTM-based POS tagger achieves the state-of-the-art results for the 'Mixed' dataset with 96.5% accuracy. However, the specific-dialect taggers trained on the 'MSA' and 'GLF' datasets achieve an accuracy of 95.6% and 95%, respectively. The results for the 'Mixed' dataset indicate the effectiveness of developing a joint POS tagger without the need for a dialect-specific POS tagger.
机译:在过去的几年里,Twitter经历了大规模的增长,其在线内容的数量迅速增加。这一内容是一个丰富的源于几项研究,专注于自然语言处理(NLP)研究。然而,Twitter数据对NLP任务构成了许多挑战和障碍。对于英语,Twitter有一个NLP工具,提供特定于Tweet的NLP任务,这为英语NLP研究和应用提供了重要的机会。英语推文的演讲(POS)标记是由此类工具提供和促进的任务之一。相比之下,只有几次尝试为在Twitter上开发用于阿拉伯语内容的POS标记。在本文中,我们考虑POS标记,这是直接影响其他后续文本处理任务的性能的NLP任务之一。我们为阿拉伯语推文的POS标记引入了三个手动注释的数据集:“混合,”的“MSA”和“GLF数据集”分别为3000,000和1000个阿拉伯语推文。此外,我们提出了对阿拉伯语推文中的使用Hashtags的行为的探索性分析,这是一种影响POS标记任务的现象。我们还提出了两个受监督的POS标记,该POS标签是根据两种方法开发的:条件随机字段和双向长期内存(BI-LSTM)模型。我们得出结论,基于Bi-LSTM的POS标签实现了“混合”数据集的最先进的结果,精度为96.5%。然而,在“MSA”和“GLF”数据集上培训的特定方言标签分别达到95.6%和95%的准确性。 “混合”数据集的结果表明,在不需要方言特定的POS标记器的情况下开发联合POS标签的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号