首页> 外文会议>IEEE International Conference on Big Data >EPIC30M: An Epidemics Corpus of Over 30 Million Relevant Tweets
【24h】

EPIC30M: An Epidemics Corpus of Over 30 Million Relevant Tweets

机译:EPIC30M:一个超过3000万相关推文的流行病毒毒品

获取原文

摘要

Since the start of COVID-19, there has been several relevant corpora from various sources that were released to support research in this area. While these corpora are valuable in supporting analysis for this specific pandemic, researchers will benefit from additional benchmark corpora that contain other epidemics for better generalizability and to facilitate cross-epidemic pattern recognition and trend analysis tasks. During our research, we discover little disease related corpora in the literature that are sizable and rich enough to support such cross-epidemic analysis tasks. To address this issue, we present EPIC30M, a large-scale epidemic corpus that contains more than 30 million micro-blog posts, i.e., tweets crawled from Twitter, from year 2006 to 2020. EPIC30M contains a subset of 26.2 million tweets related to three general diseases, namely Ebola, Cholera and Swine Flu, and another subset of 4.7 million tweets of six global epidemic outbreaks, including the 2009 H1N1 Swine Flu, 2010 Haiti Cholera, 2012 Middle-East Respiratory Syndrome (MERS), 2013 West African Ebola, 2016 Yemen Cholera and 2018 Kivu Ebola. Furthermore, we explore and discuss the properties of this corpus with statistics of key terms and hashtags and trends analysis for each subset. Finally, we discuss the potential value and impact that EPIC30M could generate through a discussion of multiple use cases of cross-epidemic research topics that attract growing interest in recent years. These use cases span multiple research areas, such as epidemiological modeling, pattern recognition, natural language understanding and economical modeling. The corpus is publicly available at https://www.github.com/junhua/epic.
机译:自Covid-19开始以来,有几个来自各种来源的相关对象,以便在该地区支持研究。虽然这些Corpora在支持这种特定大流行的分析方面有价值,但研究人员将受益于其他基准语料库,其中包含其他流行病,以便更好地普遍性,并促进跨流行模式识别和趋势分析任务。在我们的研究期间,我们在文献中发现很少有疾病相关的Corpora,这足以支持这种跨流行分析任务。为了解决这个问题,我们提供EPIC30M,一个大规模的流行性语料库,其中包含超过3000万微博帖,即从Twitter爬行的推文,从2006年到2020年。EPIC30M包含与三个相关的2620万推文的子集一般疾病,即埃博拉,霍乱和猪流感,以及另一种470万推文的六个全球流行病,包括2009年H1N1猪流感,2010年海地霍乱,2012年中东呼吸综合征(MERS),2013年西非埃博拉, 2016年也门霍乱和2018年Kivu Ebola。此外,我们探索并讨论该语料库的属性,统计每个子集的关键术语和Hashtags和趋势分析。最后,我们讨论史诗30M可以通过讨论近年来吸引日益增长的兴趣的多种用例来产生EPIC30M可能产生的潜在价值和影响。这些用例跨越多个研究领域,如流行病学建模,模式识别,自然语言理解和经济建模。语料库在https://www.github.com/junhua/epic上公开提供。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号