首页> 外文会议>IEEE International Conference on Big Data >Baselines for demographic inference on a new gold standard twitter corpus
【24h】

Baselines for demographic inference on a new gold standard twitter corpus

机译:新的金标准Twitter语料库的人口统计学推断基准

获取原文

摘要

A variety of studies have shown that machine learning methods like convolutional neural nets and random forests can be used to accurately infer characteristics of people online such as their gender, age, race, or political orientation. However, these studies are based on labels generated using the data themselves, typically human coding of subjects, and presume subjects are authentic humans. This creates systematic selection biases owing what features humans can draw inferences from. In this preliminary study, we connect Twitter Data to an exogenous data source, public voter data, to create a new gold standard data set for inferring demographic information about online participants. We run a standard battery of machine learning algorithms on bag-of-words representations of individuals' twitter posts to generate new baselines for how well these characteristics can be predicted. Our baselines are substantially lower than most reported studies, suggesting sampling bias has led to an over-estimation of how well machine learning algorithms perform on this task.
机译:各种各样的研究表明,可以使用卷积神经网络和随机森林之类的机器学习方法来准确推断网络上人们的特征,例如其性别,年龄,种族或政治倾向。但是,这些研究基于使用数据本身生成的标签,通常是受试者的人类编码,并且假定受试者是真实的人类。由于人类可以从中推断出哪些特征,这会造成系统的选择偏见。在此初步研究中,我们将Twitter数据连接到外部数据源(公共选民数据),以创建新的黄金标准数据集来推断有关在线参与者的人口统计信息。我们在个人Twitter帖子的词袋表示中运行标准的一系列机器学习算法,以生成可预测这些特征的新基准。我们的基准线大大低于大多数报告的研究,这表明采样偏差已导致对机器学习算法在此任务上的执行情况的估计过高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号