首页> 外文会议>International Conference on Information Technology Research >Word Level Language Identification of Code Mixing Text in Social Media using NLP
【24h】

Word Level Language Identification of Code Mixing Text in Social Media using NLP

机译:使用NLP的社交媒体中代码混合文本的单词级语言识别

获取原文

摘要

Understanding social media contents has been a primary research topic since the dawn of social networking. Especially, contextual understanding of the noisy text, which is characterized by a high percentage of spelling mistakes with creative spelling, phonetic typing, wordplay, abbreviations, and Meta tags. Thus, the data processing demands a more complex system than traditional natural language processors. Also people easily mixing two or more languages together to express their thoughts in social media context. So automatic language identification at word level become as necessary part for analyzing the noisy content in social media. It would help with the automated analysis of content generated on social media. This study uses Tamil-English code-mixed data from popular social media posts and comments and provided word level language tags using Natural Language Processing (NLP) and modern Machine Learning (ML) technologies. The methodology used for this system is a novel approach implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency. Different machine learning classifiers such as Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Decision Trees and Random Forest used in training and testing. Among that the highest accuracy of 89.46% was obtained in SVM classifier.
机译:了解社交媒体内容是社交网络黎明以来的主要研究主题。特别是,对嘈杂的文本的语境理解,其特征在于具有创意拼写,语音键入,WordPlay,缩写和元标记的高比例的拼写错误。因此,数据处理需要比传统的自然语言处理器更复杂的系统。人们也可以轻松将两种或多种语言混合在一起,以在社交媒体背景下表达他们的思想。因此,单词级别的自动语言识别成为在社交媒体中分析嘈杂内容的必要部分。它将有助于自动分析社交媒体上生成的内容。本研究使用来自流行社交媒体帖子和评论的泰米尔英语代码混合数据,并使用自然语言处理(NLP)和现代机器学习(ML)技术提供了单词级语言标签。用于该系统的方法是基于罗马脚本,词典,双辅音和术语频率等泰米尔Unicode字符的功能实现为机器学习分类的新方法。不同的机器学习分类器,如天真贝叶斯,物流回归,支持向量机(SVM),决策树和用于训练和测试的随机林。其中在SVM分类器中获得了89.46%的最高精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号