首页> 外文期刊>Indian Journal of Science and Technology >Roman-Urdu News Headline Classification with IR Models using Machine Learning Algorithms
【24h】

Roman-Urdu News Headline Classification with IR Models using Machine Learning Algorithms

机译:使用机器学习算法的具有IR模型的Roman-Urdu新闻标题分类

获取原文
           

摘要

Objectives: Roman-Urdu consider as a non-standard language used frequently on the Internet. To classify text from article tagging on Roman-Urdu is such difficult task because of many irregularities in spellings, for example, the word khubsurat (beautiful) in Roman-Urdu has multiple spellings. It can also be written as khoobsurat, khubsoorat, and khobsorat. Methods/Statistical Analysis: In this study, we scrap Roman-Urdu language news headline from various online newspapers. Our corpus contains 12319 news headlines which contain seven categories i.e. Accident, Sports, Weather, Arrest, Conference, Operation and Violence. We also use different preprocessing approaches like Roman-Urdu Stop words and apply IR models i.e. TF-IDF and Count Vector for feature extraction before applying classifier algorithms. Findings: We also compare results between different Machine Learning algorithm such as RF, LSVC, MNB, LR, RC, PAC, Perceptron, NC, SGDC and NC. Our model predicts best result to identify desire class on SGD classifier which gives 93.50% accuracy. Application/ Improvements: It is recommended that SGD Classifiers should be used in roman-Urdu news headline text classification.
机译:目标:Roman-Urdu被认为是Internet上经常使用的非标准语言。由于拼写中的许多不规则性,要对来自罗马乌尔都语上文章标记的文本进行分类是一项艰巨的任务,例如,罗马乌尔都语中的khubsurat(美丽)一词具有多种拼写。它也可以写成khoobsurat,khubsoorat和khobsorat。方法/统计分析:在这项研究中,我们从各种在线报纸上删除了罗马乌尔都语新闻标题。我们的语料库包含12319个新闻标题,其中包括七个类别,即事故,体育,天气,逮捕,会议,行动和暴力。我们还使用了不同的预处理方法,例如Roman-Urdu停用词,并在应用分类器算法之前应用IR模型(即TF-IDF和Count Vector)进行特征提取。结果:我们还比较了不同机器学习算法(例如RF,LSVC,MNB,LR,RC,PAC,Perceptron,NC,SGDC和NC)之间的结果。我们的模型预测了在SGD分类器上识别需求类别的最佳结果,其准确度为93.50%。应用程序/改进:建议在罗马乌尔都语新闻标题文本分类中使用SGD分类器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号