Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

机译：Mega-Cov：10亿尺度的Covid-19语言数据集

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~ 169M tweets). We release tweet IDs from the dataset. We also develop two powerful models, one for identifying whether or not a tweet is related to the pandemic (best F_1=97%) and another for detecting misinformation about COVID-19 (best F_1=92%). A human annotation study reveals the utility of our models on a subset of Mega-COV. Our data and models can be useful for studying a wide host of phenomena related to the pandemic. Mega-COV and our models are publicly available.

机译：我们描述了Mega-Cov，来自Twitter的十亿规模的数据集，用于学习Covid-19。 DataSet是多样化的（涵盖268个国家），纵向（回到2007年），多语言（有100多种语言），并拥有大量的位置标记推文（〜169米的推文）。我们从数据集中释放Tweet ID。我们还开发了两个强大的模型，一个用于识别推文是否与大流行（最佳F_1 = 97％）有关，另一个用于检测关于Covid-19的错误信息（最佳F_1 = 92％）。人类注释研究揭示了我们模型在Mega-Cov的子集中的效用。我们的数据和模型对于研究与大流行相关的广泛现象有用。 Mega-Cov和我们的模型是公开的。

著录项

来源
《Conference of the European Chapter of the Association for Computational Linguistics》|2021年|3402-3420|共19页
会议地点
作者
Muhammad Abdul-Mageed; AbdelRahim Elmadany; El Moatez Billah Nagoudi; Dinesh Pabbi; Kunal Verma; Rannie Lin;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
入库时间 2022-08-26 13:58:10

相似文献

外文文献
中文文献
专利

1. Impact of lockdown on smoking and sleeping in the early COVID-19 presence: Datasets of Greek Adults sample [J] . Dimitrios I. Bourdas, Emmanouil D. Zacharakis, Antonios K. Travlos, Data in Brief . 2021,第a期

机译：在Covid-19早期的吸烟和睡眠中锁定的影响：希腊成人样本的数据集
2. Deep convolution neural networks to differentiate between COVID-19 and other pulmonary abnormalities on chest radiographs: Evaluation using internal and external datasets [J] . Cho Yongwon, Hwang Sung Ho, Oh Yu-Whan, International journal of imaging systems and technology . 2021,第3期

机译：深度卷积神经网络区分Covid-19和胸部射线照片的其他肺异常：使用内部和外部数据集进行评估
3. Transfer learning for establishment of recognition of COVID-19 on CT imaging using small-sized training datasets [J] . Li Chun, Yang Yunyun, Liang Hui, Knowledge-Based Systems . 2021,第Apra22期

机译：通过小型训练数据集在CT成像中建立CT成像的识别
4. Efficient Indexing of Billion-Scale Datasets of Deep Descriptors [C] . Artem Babenko Yandex, Victor Lempitsky IEEE Conference on Computer Vision and Pattern Recognition . 2016

机译：深度描述符十亿规模数据集的有效索引
5. Understanding the Importance of Entities and Roles in Natural Language Inference : A Model and Datasets [D] . Shrivastava, Ishan. 2019

机译：了解实体和角色在自然语言推理中的重要性：模型和数据集
6. Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words slangs and typos with equivalent proper words [O] . Bernard Masua, Noel Masasi 2020

机译：增强斯瓦希里语语言的文本预处理：用于普通斯瓦希里语的数据集俚语俚语和具有相同正确单词的拼写
7. How do we share data in COVID-19 research? A systematic review of COVID-19 datasets in PubMed Central Articles [O] . Xu Zuo, Yong Chen, Lucila Ohno-Machado, 2020

机译：我们如何在Covid-19研究中分享数据？ PubMed中央文章中Covid-19数据集的系统审查

Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

摘要

著录项

相似文献

相关主题

期刊订阅