首页> 外文期刊>JMIR public health and surveillance. >Predicting Age Groups of Reddit Users Based on Posting Behavior and Metadata: Classification Model Development and Validation
【24h】

Predicting Age Groups of Reddit Users Based on Posting Behavior and Metadata: Classification Model Development and Validation

机译:根据发布行为和元数据预测Reddit用户的年龄组:分类模型开发和验证

获取原文
       

摘要

Background Social media are important for monitoring perceptions of public health issues and for educating target audiences about health; however, limited information about the demographics of social media users makes it challenging to identify conversations among target audiences and limits how well social media can be used for public health surveillance and education outreach efforts. Certain social media platforms provide demographic information on followers of a user account, if given, but they are not always disclosed, and researchers have developed machine learning algorithms to predict social media users’ demographic characteristics, mainly for Twitter. To date, there has been limited research on predicting the demographic characteristics of Reddit users. Objective We aimed to develop a machine learning algorithm that predicts the age segment of Reddit users, as either adolescents or adults, based on publicly available data. Methods This study was conducted between January and September 2020 using publicly available Reddit posts as input data. We manually labeled Reddit users’ age by identifying and reviewing public posts in which Reddit users self-reported their age. We then collected sample posts, comments, and metadata for the labeled user accounts and created variables to capture linguistic patterns, posting behavior, and account details that would distinguish the adolescent age group (aged 13 to 20 years) from the adult age group (aged 21 to 54 years). We split the data into training (n=1660) and test sets (n=415) and performed 5-fold cross validation on the training set to select hyperparameters and perform feature selection. We ran multiple classification algorithms and tested the performance of the models (precision, recall, F1 score) in predicting the age segments of the users in the labeled data. To evaluate associations between each feature and the outcome, we calculated means and confidence intervals and compared the two age groups, with 2-sample t tests, for each transformed model feature. Results The gradient boosted trees classifier performed the best, with an F1 score of 0.78. The test set precision and recall scores were 0.79 and 0.89, respectively, for the adolescent group (n=254) and 0.78 and 0.63, respectively, for the adult group (n=161). The most important feature in the model was the number of sentences per comment (permutation score: mean 0.100, SD 0.004). Members of the adolescent age group tended to have created accounts more recently, have higher proportions of submissions and comments in the r/teenagers subreddit, and post more in subreddits with higher subscriber counts than those in the adult group. Conclusions We created a Reddit age prediction algorithm with competitive accuracy using publicly available data, suggesting machine learning methods can help public health agencies identify age-related target audiences on Reddit. Our results also suggest that there are characteristics of Reddit users’ posting behavior, linguistic patterns, and account features that distinguish adolescents from adults.
机译:背景社交媒体对于监测对公共卫生问题的看法以及教育有关健康的目标受众的看法很重要;但是,有关社交媒体用户人口统计学的有限信息使得它挑战了识别目标受众之间的对话并限制社交媒体如何用于公共卫生监测和教育外展努力。某些社交媒体平台提供有关用户帐户的追随者的人口统计信息,如果给定,但它们并不总是披露,研究人员已经开发了机器学习算法,以预测社交媒体用户的人口特征,主要用于推特。迄今为止,有限预测Reddit用户的人口特征有限。目的我们旨在开发一种机器学习算法,该算法根据公开的数据,预测reddit用户的年龄段,作为青少年或成人。方法使用公开可用的Reddit Post作为输入数据,在1月和9月20日之间进行了这项研究。我们通过识别和审查Reddit用户自我报告年龄的公开员额来手动标记Reddit用户的年龄。然后,我们为标记的用户帐户收集了示例帖子,评论和元数据,并创建了变量以捕获从成年年龄组(年龄)的青少年年龄组(13至20年)区分的语言模式21至54岁)。我们将数据拆分为训练(n = 1660)和测试集(n = 415),并在训练集上执行5倍交叉验证以选择HyperParameters并执行功能选择。我们运行多个分类算法并测试了模型的性能(精确,召回,F1得分),以预测标记数据中用户的年龄段。为了评估每个特征和结果之间的关联,我们计算了手段和置信区间,并将两个年龄组与两个样本T测试进行了比较,每个变换模型特征都有。结果梯度提升树木分类器表现最佳,F1得分为0.78。测试设定精度和召回评分分别为0.79和0.89,分别用于成人组(n = 161)。该模型中最重要的特征是每次评论的句子数(排列得分:平均0.100,SD 0.004)。青少年年龄集团的成员倾向于更近最近创建了账目,在R /青少年削弱的情况下具有更高的提交提交和评论,并在更高的用户计数中张贴更多的资金,而不是成年人。结论我们使用公开的数据创建了一种具有竞争精度的红线车年龄预测算法,建议机器学习方法可以帮助公共卫生机构识别红线上的年龄相关的目标受众。我们的结果还表明,Reddit用户的发布行为,语言模式和区分成年人的帐户特征存在特征。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号