首页> 外文会议>IEEE International Conference on Automation and Computing >Statistical Topic Modeling for Urdu Text Articles
【24h】

Statistical Topic Modeling for Urdu Text Articles

机译:URDU文本文章统计主题建模

获取原文

摘要

Natural Language Processing (NLP) is a branch of Artificial Intelligence to help computers manipulate and interpret human languages. In NLP, text mining is a technique to derive useful information from text. Topic Model (TM) is a statistical model to extract topics from a large collection of unlabeled text using NLP and machine learning techniques. Several effective TM are available to fulfill the needs of various languages like English, German, Arabic etc. However no compelling TM is available for poor resource South Asian language Urdu. In this research study, our focus is to work on existing TM like Latent Dirichlet Allocation (LDA) to overcome the issues of Urdu language in text mining. We studied and analyzed LDA as an unsupervised model for the Urdu topic identification. Hence, we studied LDA deeply for Urdu topic identification at two levels: Variational Bayes (VB) based LDA for Urdu (VB-ULDA) with stemmer and without stemmer. Experiments are performed on a self-created massive number of Urdu documents in four different corpora. Experimental study shows that VB-ULDA outperformed in the identification of topics from Urdu text documents as compared to existing Urdu LDA (ULDA) in terms of accuracy and efficiency and results also reveal the high impact of stemming algorithm in Urdu topic identification.
机译:自然语言处理(NLP)是人工智能在计算机协助分支操作和解释人类的语言。在NLP,文本挖掘是从文本中汲取有用的信息的技术。主题模型(TM)是一个统计模型,提取使用NLP从收集了大量未标记文本的主题和机器学习技术。几种有效的TM可用来满足不同的语言,如英语,德语,阿拉伯语等的需求,但是没有令人信服的TM可用于资源贫乏南亚语言乌尔都语。在这项研究,我们的重点是在诸如隐含狄利克雷分布(LDA)现有TM工作,以克服在文本挖掘乌尔都语语言的问题。我们研究和分析LDA作为乌尔都语主题辨别无监督模型。因此,我们研究了LDA深深地为乌尔都语的主题辨别在两个层面:变贝叶斯(VB)的LDA为乌尔都语(VB-ULDA)与词干和无词干。实验是在四个不同的语料库乌尔都语文档的自我创造数量庞大的执行。实验研究表明,VB-ULDA从乌尔都语文本文档主题的标识跑赢相比,在精度和效率和效果方面存在的乌尔都语LDA(ULDA)也揭示乌尔都语的主题识别算法所产生的高冲击。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号