首页> 外文会议>Conference on empirical methods in natural language processing >Confounds and Consequences in Geotagged Twitter Data
【24h】

Confounds and Consequences in Geotagged Twitter Data

机译:带有地理标签的Twitter数据中的困惑和后果

获取原文

摘要

Twitter is often used in quantitative studies that identify geographically-preferred topics, writing styles, and entities. These studies rely on either GPS coordinates attached to individual messages, or on the user-supplied location field in each profile. In this paper, we compare these data acquisition techniques and quantify the biases that they introduce; we also measure their effects on linguistic analysis and text-based geolocation. GPS-tagging and self-reported locations yield measurably different corpora, and these linguistic differences are partially attributable to differences in dataset composition by age and gender. Using a latent variable model to induce age and gender, we show how these demographic variables interact with geography to affect language use. We also show that the accuracy of text-based geolocation varies with population demographics, giving the best results for men above the age of 40.
机译:Twitter通常用于定量研究中,以识别地理上偏爱的主题,写作风格和实体。这些研究依赖于附加到单个消息的GPS坐标,或依赖于每个配置文件中用户提供的位置字段。在本文中,我们比较了这些数据采集技术并量化了它们带来的偏差。我们还评估了它们对语言分析和基于文本的地理位置的影响。 GPS标记和自我报告的位置产生的语料库明显不同,这些语言差异部分归因于按年龄和性别划分的数据集组成的差异。使用潜在变量模型来推断年龄和性别,我们展示了这些人口统计学变量如何与地理位置相互作用以影响语言使用。我们还显示,基于文本的地理位置的准确性随人口统计学的变化而变化,从而为40岁以上的男性提供最佳结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号