Leveraging Stratification in Twitter Sampling

机译：在Twitter抽样中利用分层

页面导航

摘要
著录项
相似文献
相关主题

摘要

With Tweet volumes reaching 500 million a day, sampling is inevitable for any application using Twitter data. Realizing this, data providers such as Twitter, Gnip and Boardreader license sampled data streams priced in accordance with the sample size. Big Data applications working with sampled data would be interested in working with a large enough sample that is representative of the universal dataset. Previous work focusing on the representativeness issue has considered ensuring the global occurrence rates of key terms, be reliably estimated from the sample. Present technology allows sample size estimation in accordance with probabilistic bounds on occurrence rates for the case of uniform random sampling. In this paper, we consider the problem of further improving sample size estimates by leveraging stratification in Twitter data. We analyze our estimates through an extensive study using simulations and real-world data, establishing the superiority of our method over uniform random sampling. Our work provides the technical know-how for data providers to expand their portfolio to include stratified sampled datasets, whereas applications are benefited by being able to monitor more topics/events at the same data and computing cost.

机译：随着Tweet每天的流量达到5亿个，使用Twitter数据的任何应用程序都不可避免地要进行采样。意识到这一点，Twitter，Gnip和Boardreader之类的数据提供者许可根据样本大小定价的样本数据流。使用采样数据的大数据应用程序可能会对使用足够大的代表通用数据集的样本感兴趣。以前专注于代表性问题的工作已考虑确保从样本中可靠地估计关键术语的整体发生率。对于均匀随机采样的情况，本技术允许根据出现率的概率范围来估计样本大小。在本文中，我们考虑了利用Twitter数据中的分层来进一步改善样本量估计的问题。我们通过使用模拟和现实世界数据进行的广泛研究来分析我们的估计，从而确立了我们的方法优于统一随机抽样的优势。我们的工作为数据提供者提供了技术诀窍，以扩展其产品组合，以包括分层抽样的数据集，而能够以相同的数据和计算成本监控更多主题/事件的应用程序将受益匪浅。

著录项

作者
Joshi Vikas; Padmanabhan Deepak; Subramaniam LV;
展开▼
作者单位

展开▼
年度 2016
总页数
原文格式 PDF
正文语种 eng
中图分类

相似文献

外文文献
中文文献
专利

1. Inferring Twitters' Socio-demographics to Correct Sampling Bias of Social Media Data for Augmenting Travel Behavior Analysis [J] . Yu Cui, Qing He Journal of Big Data Analytics in Transportation . 2021,第2期

机译：推断嫁偶的社会人口统计数据，以纠正社交媒体数据的抽样偏见，以增加旅行行为分析
2. Walking Through Twitter: Sampling a Language-Based Follow Network of Influential Twitter Accounts [J] . Felix Victor Münch, Ben Thies, Cornelius Puschmann, Social Media + Society . 2021,第1期

机译：走过推特：采样基于语言的遵循有影响力的推特账户网络
3. Representing the Twittersphere: Archiving a representative sample of Twitter data under resource constraints [J] . Hino Airo, Fahey Robert A. International Journal of Information Management . 2019,第Octa期

机译：代表Twittersphere：在资源限制下归档Twitter数据的代表性样本
4. Leveraging Stratification in Twitter Sampling [C] . Vikas Joshi, Deepak P., L. V. Subramaniam European Conference on Artificial Intelligence . 2016

机译：在Twitter采样中利用分层
5. Leveraging Twitter Data to Support Transit Planning and Operations [D] . Kabbani, Omar. 2020

机译：利用Twitter数据来支持过境计划和运营
6. Leveraging machine learning-based approaches to assess human papillomavirus vaccination sentiment trends with Twitter data [O] . Jingcheng Du, Jun Xu, Hsing-Yi Song, 2017

机译：利用基于机器学习的方法通过Twitter数据评估人乳头瘤病毒疫苗接种情绪趋势
7. Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose [O] . Morstatter, Fred, Pfeffer, Jürgen, Liu, Huan, 2013

机译：样品足够好吗？比较来自Twitter的streaming apI的数据与Twitter的Firehose

Leveraging Stratification in Twitter Sampling

摘要

著录项

相似文献

相关主题

期刊订阅