首页> 外文期刊>Indian Journal of Science and Technology >Classification of Sindhi Headline News Documents based on TF-IDF Text Analysis Scheme
【24h】

Classification of Sindhi Headline News Documents based on TF-IDF Text Analysis Scheme

机译:基于TF-IDF文本分析方案的信德头条新闻文件分类

获取原文
       

摘要

Objectives: Sindhi language, historically rich belongs to Indo-Aryan language with diverse background and diverse dialects. Recent drive in globalization, e-commerce and e-literacy have influenced on languages as well. There are lots of magazines, Sindhi books, newspapers and web material available online, but unluckily still proper dataset is not designed for Sindhi information processing. This research study focuses on the Sindhi language news headline texts dataset and automated tool for the online texts’ classification based on the predefined label. Methods/Statistical Analysis: For the collection of datasets, the scraping tool is designed for extraction of the headline news from most popular newspapers: Awami Awaz and Daily Jhoongar. The dataset contains 2800 Sindhi headline news with five categories: 0. Entertainment, 1. Sports, 2. Science and Technology, 3. International, 4. National, 5. Sindhi news. The dataset is normalized by removing stop words and cleaning the spaces, punctuations and other unnecessary texts. Furthermore, the language feature is analyzed using TF-IDF and vector model. This paper presents Sindhi headline news classification model with implementation of the machine learning classification algorithms, namely. Multinomial NB, Linear SVC, Logistic Regression, MLP classifier, SGD Classifier, Random Forest Classifier, Ridge Classifier. Findings: The results show that the performance of the Linear SVC and MLP Classifier indicate better results on Sindhi headlines news categorization as compared to other classification techniques. This research study helps in improving the automatic classification of Sindhi text headline news. Application/Improvements: It is recommended that LSVC and MLP Classifiers should be used in Sindhi language news headline classification.
机译:目标:信德语,历史上很丰富,属于具有各种背景和方言的印度-雅利安语。全球化,电子商务和电子扫盲的最新发展也对语言产生了影响。网上有很多杂志,信德书籍,报纸和网络材料,但不幸的是,仍然没有为信德信息处理设计适当的数据集。这项研究专注于Sindhi语言新闻标题文本数据集和用于基于预定义标签对在线文本进行分类的自动化工具。方法/统计分析:为了收集数据集,设计了刮工具以从最受欢迎的报纸:Awami Awaz和Daily Jhoongar中提取头条新闻。数据集包含2800个Sindhi头条新闻,分为五个类别:0。娱乐,1。体育,2。科学和技术,3。国际,4。国家,5。Sindhi新闻。通过删除停用词并清除空格,标点符号和其他不必要的文本来规范化数据集。此外,使用TF-IDF和向量模型分析语言特征。本文提出了Sindhi头条新闻分类模型,并实现了机器学习分类算法。多项式NB,线性SVC,逻辑回归,MLP分类器,SGD分类器,随机森林分类器,岭分类器。结果:结果表明,与其他分类技术相比,线性SVC和MLP分类器的性能在信德头条新闻分类中显示出更好的结果。这项研究有助于改善Sindhi文字标题新闻的自动分类。应用程序/改进:建议在Sindhi语言新闻标题分类中使用LSVC和MLP分类器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号