首页> 外文会议>European Conference on Artificial Intelligence >Leveraging Stratification in Twitter Sampling
【24h】

Leveraging Stratification in Twitter Sampling

机译:在Twitter采样中利用分层

获取原文

摘要

With Tweet volumes reaching 500 million a day, sampling is inevitable for any application using Twitter data. Realizing this, data providers such as Twitter, Gnip and Boardreader license sampled data streams priced in accordance with the sample size. Big Data applications working with sampled data would be interested in working with a large enough sample that is representative of the universal dataset. Previous work focusing on the representativeness issue has considered ensuring that global occurrence rates of key terms, be reliably estimated from the sample. Present technology allows sample size estimation in accordance with probabilistic bounds on occurrence rates for the case of uniform random sampling. In this paper, we consider the problem of further improving sample size estimates by leveraging stratification in Twitter data. We analyze our estimates through an extensive study using simulations and real-world data, establishing the superiority of our method over uniform random sampling. Our work provides the technical knowhow for data providers to expand their portfolio to include stratified sampled datasets, whereas applications are benefited by being able to monitor more topics/events at the same data and computing cost.
机译:随着推文卷每天达到500万,采样对于使用Twitter数据的任何应用程序都是不可避免的。实现这一点,如Twitter,Gnip和BoardReader许可证的数据提供者按照样本大小定价的采样数据流。使用采样数据的大数据应用程序将有兴趣使用具有代表通用数据集的大量样本。以前的工作,专注于代表性问题已被认为确保可靠地从样本中可靠地估计关键术语的全球发生率。目前的技术允许根据概率尺寸估计根据均匀随机抽样的发生率的概率范围。在本文中,我们考虑通过利用Twitter数据中的分层进一步提高样本量估计的问题。我们通过使用模拟和现实数据的广泛研究分析我们的估计,在均匀的随机抽样中建立了我们方法的优越性。我们的工作为数据提供商提供了技术知识,以扩展其投资组合以包括分层采样数据集,而应用程序受益于能够以相同的数据和计算成本监控更多主题/事件。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号