Leveraging Stratification in Twitter Sampling

机译：在Twitter采样中利用分层

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

With Tweet volumes reaching 500 million a day, sampling is inevitable for any application using Twitter data. Realizing this, data providers such as Twitter, Gnip and Boardreader license sampled data streams priced in accordance with the sample size. Big Data applications working with sampled data would be interested in working with a large enough sample that is representative of the universal dataset. Previous work focusing on the representativeness issue has considered ensuring that global occurrence rates of key terms, be reliably estimated from the sample. Present technology allows sample size estimation in accordance with probabilistic bounds on occurrence rates for the case of uniform random sampling. In this paper, we consider the problem of further improving sample size estimates by leveraging stratification in Twitter data. We analyze our estimates through an extensive study using simulations and real-world data, establishing the superiority of our method over uniform random sampling. Our work provides the technical knowhow for data providers to expand their portfolio to include stratified sampled datasets, whereas applications are benefited by being able to monitor more topics/events at the same data and computing cost.

机译：随着推文卷每天达到500万，采样对于使用Twitter数据的任何应用程序都是不可避免的。实现这一点，如Twitter，Gnip和BoardReader许可证的数据提供者按照样本大小定价的采样数据流。使用采样数据的大数据应用程序将有兴趣使用具有代表通用数据集的大量样本。以前的工作，专注于代表性问题已被认为确保可靠地从样本中可靠地估计关键术语的全球发生率。目前的技术允许根据概率尺寸估计根据均匀随机抽样的发生率的概率范围。在本文中，我们考虑通过利用Twitter数据中的分层进一步提高样本量估计的问题。我们通过使用模拟和现实数据的广泛研究分析我们的估计，在均匀的随机抽样中建立了我们方法的优越性。我们的工作为数据提供商提供了技术知识，以扩展其投资组合以包括分层采样数据集，而应用程序受益于能够以相同的数据和计算成本监控更多主题/事件。

著录项

来源
《European Conference on Artificial Intelligence》|2016年|913-1833p|共9页
会议地点
作者
Vikas Joshi; Deepak P.; L. V. Subramaniam;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18-53;
关键词

相似文献

外文文献
中文文献
专利

1. Inferring Twitters' Socio-demographics to Correct Sampling Bias of Social Media Data for Augmenting Travel Behavior Analysis [J] . Yu Cui, Qing He Journal of Big Data Analytics in Transportation . 2021,第2期

机译：推断嫁偶的社会人口统计数据，以纠正社交媒体数据的抽样偏见，以增加旅行行为分析
2. Walking Through Twitter: Sampling a Language-Based Follow Network of Influential Twitter Accounts [J] . Felix Victor Münch, Ben Thies, Cornelius Puschmann, Social Media + Society . 2021,第1期

机译：走过推特：采样基于语言的遵循有影响力的推特账户网络
3. Representing the Twittersphere: Archiving a representative sample of Twitter data under resource constraints [J] . Hino Airo, Fahey Robert A. International Journal of Information Management . 2019,第Octa期

机译：代表Twittersphere：在资源限制下归档Twitter数据的代表性样本
4. Leveraging Stratification in Twitter Sampling [C] . Vikas Joshi, Deepak P., L. V. Subramaniam European Conference on Artificial Intelligence . 2016

机译：在Twitter采样中利用分层
5. Leveraging Twitter Data to Support Transit Planning and Operations [D] . Kabbani, Omar. 2020

机译：利用Twitter数据来支持过境计划和运营
6. Leveraging machine learning-based approaches to assess human papillomavirus vaccination sentiment trends with Twitter data [O] . Jingcheng Du, Jun Xu, Hsing-Yi Song, 2017

机译：利用基于机器学习的方法通过Twitter数据评估人乳头瘤病毒疫苗接种情绪趋势
7. Leveraging Stratification in Twitter Sampling [O] . Joshi Vikas, Padmanabhan Deepak, Subramaniam LV 2016

机译：在Twitter抽样中利用分层

Leveraging Stratification in Twitter Sampling

摘要

著录项

相似文献

相关主题

期刊订阅