首页> 外文会议>IEEE International Conference on Information Reuse and Integration >Classification for Authorship of Tweets by Comparing Logistic Regression and Naive Bayes Classifiers
【24h】

Classification for Authorship of Tweets by Comparing Logistic Regression and Naive Bayes Classifiers

机译:通过比较Logistic回归和天真贝叶斯分类器的推文作者的分类

获取原文

摘要

At a time when all it takes to open a Twitter account is a mobile phone, the act of authenticating information encountered on social media becomes very complex, especially when we lack measures to verify digital identities in the first place. Because the platform supports anonymity, fake news generated by dubious sources have been observed to travel much faster and farther than real news. Hence, we need valid measures to identify authors of misinformation to avert these consequences. Researchers propose different authorship attribution techniques to approach this kind of problem. However, because tweets are made up of only 280 characters, finding a suitable authorship attribution technique is a challenge. This research aims to classify authors of tweets by comparing machine learning methods like logistic regression and naive Bayes. The processes of this application are fetching of tweets, pre-processing, feature extraction, and developing a machine learning model for classification. This paper illustrates the text classification for authorship process using machine learning techniques. In total, there were 46,895 tweets used as both training and testing data, and unique features specific to Twitter were extracted. Several steps were done in the pre-processing phase, including removal of short texts, removal of stop-words and punctuations, tokenizing and stemming of texts as well. This approach transforms the pre-processed data into a set of feature vector in Python. Logistic regression and naive Bayes algorithms were applied to the set of feature vectors for the training and testing of the classifier. The logistic regression based classifier gave the highest accuracy of 91.1% compared to the naive Bayes classifier with 89.8%.
机译:在打开Twitter帐户的所有时间是一个手机时,社交媒体遇到的信息的行为变得非常复杂,特别是当我们缺乏措施验证数字身份的第一名时。因为该平台支持匿名,所以已经观察到由可疑来源产生的假新闻比真正的新闻更快更远。因此,我们需要有效措施来识别错误信息的作者以避免这些后果。研究人员提出了不同的作者归因技术来解决这种问题。但是,因为推文仅由280个字符组成,所以找到合适的作者归因技术是一项挑战。本研究旨在通过比较Logistic回归和天真贝叶斯等机器学习方法来分类推文的作者。本申请的过程正在提取推文,预处理,特征提取,以及开发用于分类的机器学习模型。本文说明了使用机器学习技术的作者流程的文本分类。总共有46,895次推文用作培训和测试数据,提取特定于Twitter的独特功能。在预处理阶段完成了几个步骤,包括删除短文本,删除止血和标点,以及文本的销称和串行。该方法将预处理数据转换为Python中的一组特征向量。逻辑回归和天真贝叶斯算法应用于该组特征向量,用于分类器的培训和测试。基于逻辑回归的分类器与具有89.8 %的Naive Bayes分类器相比,最高精度为91.1 %。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号