首页> 外文会议>International Conference on Computational Intelligence for Smart Power System and Sustainable Energy >Topic categorization of Tamil News Articles using PreTrained Word2Vec Embeddings with Convolutional Neural Network
【24h】

Topic categorization of Tamil News Articles using PreTrained Word2Vec Embeddings with Convolutional Neural Network

机译:使用预训练的Word2Vec嵌入和卷积神经网络对泰米尔语新闻报道进行主题分类

获取原文

摘要

Almost all the problems in NLP are solved using various techniques from machine learning to Deep Learning. Still, there is mystery in language localization. NLP problems are unclear for languages other than English. The problems may be named as Entity Extraction, OCR or classification and prediction in sequence modelling. The amount of people using local language (Tamil, Telegu, Hindi etc) in the social media is increasing, so it is important to automate the process of classifying those contents. Here, the aim is to classify the Tamil news articles to its related topics (Sports, Cinema, Politics). In the existing work they have approached traditional machine learning methods with TFIDF of words as features. In this work we have compared the existing TFIDF feature learning along with Pre-Trained embeddings given to Convolutional Neural Networks (CNN). We found that CNN with pretrained embeddings gave better F1 score compare to TFIDF feature learned with Support Vector Machine (SVM), Naive Bayes (NB) algorithm.
机译:使用从机器学习到深度学习的各种技术,几乎可以解决NLP中的所有问题。尽管如此,语言本地化还是一个谜。对于除英语以外的其他语言,NLP问题尚不清楚。这些问题可能被称为实体提取,OCR或序列建模中的分类和预测。在社交媒体中使用本地语言(泰米尔语,泰勒古语,印地语等)的人数正在增加,因此自动化对这些内容进行分类的过程非常重要。在这里,目的是将泰米尔语新闻文章归类为其相关主题(体育,电影,政治)。在现有工作中,他们采用了以单词TFIDF为特征的传统机器学习方法。在这项工作中,我们将现有的TFIDF特征学习与卷积神经网络(CNN)的预训练嵌入进行了比较。我们发现,与使用支持向量机(SVM),朴素贝叶斯(NB)算法学习的TFIDF特征相比,具有预训练嵌入的CNN给出了更好的F1分数。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号