Baselines for demographic inference on a new gold standard twitter corpus

机译：新的金标准Twitter语料库的人口统计学推断基准

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

A variety of studies have shown that machine learning methods like convolutional neural nets and random forests can be used to accurately infer characteristics of people online such as their gender, age, race, or political orientation. However, these studies are based on labels generated using the data themselves, typically human coding of subjects, and presume subjects are authentic humans. This creates systematic selection biases owing what features humans can draw inferences from. In this preliminary study, we connect Twitter Data to an exogenous data source, public voter data, to create a new gold standard data set for inferring demographic information about online participants. We run a standard battery of machine learning algorithms on bag-of-words representations of individuals' twitter posts to generate new baselines for how well these characteristics can be predicted. Our baselines are substantially lower than most reported studies, suggesting sampling bias has led to an over-estimation of how well machine learning algorithms perform on this task.

机译：各种各样的研究表明，可以使用卷积神经网络和随机森林之类的机器学习方法来准确推断网络上人们的特征，例如其性别，年龄，种族或政治倾向。但是，这些研究基于使用数据本身生成的标签，通常是受试者的人类编码，并且假定受试者是真实的人类。由于人类可以从中推断出哪些特征，这会造成系统的选择偏见。在此初步研究中，我们将Twitter数据连接到外部数据源（公共选民数据），以创建新的黄金标准数据集来推断有关在线参与者的人口统计信息。我们在个人Twitter帖子的词袋表示中运行标准的一系列机器学习算法，以生成可预测这些特征的新基准。我们的基准线大大低于大多数报告的研究，这表明采样偏差已导致对机器学习算法在此任务上的执行情况的估计过高。

著录项

来源
《IEEE International Conference on Big Data》|2017年|4822-4823|共2页
会议地点
作者
Jason Radford; Luke Horgan; David Lazer;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Twitter; Big Data; Standards; Prediction algorithms; Gold; Batteries;

机译：Twitter;大数据;标准;预测算法;黄金;电池;

相似文献

外文文献
中文文献
专利

1. Twitter n-gram corpus with demographic metadata [J] . Amac Herdagdelen Language Resources and Evaluation . 2013,第4期

机译：带有人口统计数据的Twitter n-gram语料库
2. Twitter n-gram corpus with demographic metadata [J] . Amaç Herdağdelen Language Resources and Evaluation . 2013,第4期

机译：带有人口统计数据的Twitter n-gram语料库
3. Baseline demographics of a non-native lake trout population and inferences for suppression from sensitivity-elasticity analyses. [J] . Cox B. S., Guy C. S., Fredenberg W. A., Fisheries Management and Ecology . 2013,第5期

机译：非本地鳟鱼种群的基线人口统计资料，以及从敏感性-弹性分析得出的抑制作用推论。
4. Baselines for demographic inference on a new gold standard twitter corpus [C] . Jason Radford, Luke Horgan, David Lazer IEEE International Conference on Big Data . 2017

机译：用于新金标准Twitter语料库的人口统计推断基线
5. Charge kaon production in proton+proton and deuteron+gold collisions, the baseline comparison systems for understanding gold+gold collisions at RHIC. [D] . Mironov, Camelia M. 2005

机译：质子+质子和氘核+金碰撞中的电荷kaon生成，这是用于比较RHIC的金+金碰撞的基线比较系统。
6. Towards the Inference of Social and Behavioral Determinants of Sexual Health: Development of a Gold-Standard Corpus with Semi-Supervised Learning [O] . Daniel J. Feller, Jason Zucker, Oliver Bear Don’t Walk IV, 2018

机译：对性健康的社会和行为决定因素的推论：半监督学习的金标准语料库的发展。
7. A multilingual parallel parsed corpus as gold standard for grammatical inference evaluation [O] . van Zaanen M, Roberts A, Atwell ES 2004

机译：多语言并行解析语料库作为语法推理评估的金标准

Baselines for demographic inference on a new gold standard twitter corpus

摘要

著录项

相似文献

相关主题

期刊订阅