...
首页> 外文期刊>Procedia Computer Science >News Article Text Classification in Indonesian Language
【24h】

News Article Text Classification in Indonesian Language

机译:新闻文章印度尼西亚语中的文本分类

获取原文

摘要

This research intends to find the appropriate algorithm to automatically classify a news article in Indonesian Language. We obtain our dataset which is taken by using a web crawling method from www.cnnindonesia.com. First of all, the document will first undergo some Text Preprocessing method in the form of Lemmatization and Stopwords Removal. The reason we are doing the Text Preprocessing step before anything else is to minimize the noise in the document. Next, we apply Feature Selection onto the document to further separate important words and less important words inside the document. After applying Feature Selection, the document will be classified by the classifier. We are comparing the TF-IDF and SVD algorithm for feature selection, while also comparing the Multinomial Na?ve Bayes, Multivariate Bernoulli Na?ve Bayes, and Support Vector Machine for the Classifiers. Based on the test results, the combination of TF-IDF and Multinomial Na?ve Bayes Classifier gives the highest result compared to the other algorithms, which precision is 0.9841519 and its recall is 0.9840000. The result outperform the previous similar study that classify news article in Indonesian language which obtained 85% of accuracy.
机译:本研究打算找到适当的算法,以在印度尼西亚语言中自动分类新闻文章。我们获取我们的数据集,通过使用www.cnnindonesia.com使用Web爬网方法进行的。首先,该文档将以lemmatization和stopwords删除的形式接受一些文本预处理方法。我们在其他任何内容之前正在进行文本预处理步骤的原因是最小化文档中的噪声。接下来,我们将功能选择应用到文档上,以在文档中进一步分开重要的单词和不太重要的单词。应用功能选择后,文档将由分类器分类。我们正在比较特征选择的TF-IDF和SVD算法,同时还比较多项式Na ve Bayes,多元伯努利Na'Ve贝叶斯,以及支持分类器的支持向量机。基于测试结果,与其他算法相比,TF-IDF和多项式Na ve + Ve贝叶斯分类器的组合给出了最高结果,其精度为0.9841519,其召回是0.9840000。结果优先于前面的类似研究,将新闻文章以印度尼西亚语言分类,该文章获得了85%的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号