首页> 外文会议>International Conference on eDemocracy eGovernment >Using Reddit Data for Multi-Label Text Classification of Twitter Users Interests
【24h】

Using Reddit Data for Multi-Label Text Classification of Twitter Users Interests

机译:使用Reddit数据对Twitter用户兴趣进行多标签文本分类

获取原文

摘要

The automation process for inferring users' interest groups is a challenge task in social networks research and it has applications in marketing and recommendation systems. Manually labeling of documents is a difficult and an expensive task, but it is essential for training an automatic text classifier. Actually, there are several approaches where the problem is treated as a multi-label prediction task. In this work, a methodology is proposed to automatically categorize data by considering Reddit and Twitter data. First, a dataset of 42.100 publications belongs to popular forums site Reddit is collected to train a model with labeled data. Then, a dataset of tweets, an average of 100 tweets per user, from 1573 profiles is collected to predict users' topics of interest with the trained model. Finally, we were able to automatically categorize data with an average precision of 75.62%.
机译:推断用户兴趣组的自动化过程是社交网络研究中的一项艰巨任务,并且已在营销和推荐系统中得到应用。手动标记文档是一项艰巨且昂贵的任务,但是对于训练自动文本分类器而言,这是必不可少的。实际上,有几种方法可将问题视为多标签预测任务。在这项工作中,提出了一种通过考虑Reddit和Twitter数据对数据进行自动分类的方法。首先,收集了属于热门论坛站点Reddit的42.100种出版物的数据集,以训练带有标记数据的模型。然后,收集来自1573个配置文件的推文数据集(每个用户平均100条推文),以使用训练后的模型预测用户感兴趣的主题。最后,我们能够以75.62%的平均精度对数据进行自动分类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号