首页> 外文会议>IEEE International Conference on Communications Workshops >Evaluation of Hybrid Unsupervised and Supervised Machine Learning Approach to Detect Self-Reporting of COVID-19 Symptoms on Twitter
【24h】

Evaluation of Hybrid Unsupervised and Supervised Machine Learning Approach to Detect Self-Reporting of COVID-19 Symptoms on Twitter

机译:杂交无监督和监督机器学习方法检测Covid-19在Twitter上的自我报告的评价

获取原文

摘要

With over 127 million cases globally, the COVID-19 pandemic marks a sentinel event in global health. However, true case estimations have been elusive due to lack of testing and diagnostic capacity, asymptomatic cases, and individuals who do not get tested or seek care. Concomitantly, new digital surveillance tools to detect, characterize, and report COVID-19 cases are emerging, including using structured and unstructured data from users self-reporting COVID-19-related experiences on the Internet and social media platforms. In this study, we develop and evaluate a hybrid unsupervised and supervised machine learning approach to detect self-reported COVID-19-related symptoms on Twitter during the early stages of the pandemic. Tweets were collected from the public API stream from March 3rd-31st 2020, filtered for COVID-19-related terms. We used the biterm topic model to cluster tweets into theme-associated groups for the first 18 days of tweets, which were then extracted and manually annotated to identify users self-reporting suspected COVID-19 symptoms or status. Using this manually annotated data as a training set, we used an XLNet deep learning model for classifying symptom-related tweets from a larger corpus and also evaluated model performance. From 4,492,954 tweets collected, the unsupervised learning process yielded 3,465 (<1%) symptom tweets used to form our ground-truth COVID-19 symptoms dataset (n = 11,550). The XLNet text classifier achieved the highest accuracy (.91) and f1 (.62) compared to baseline models evaluated for classification. After re-training with adjusted loss function, we boosted the classifier’s precision to 0.81 while maintaining a high f1 (0.66), resulting in identification of an additional 2,622 symptom-related tweets when applied to an additional 11 days of tweets collected. Our study used a hybrid machine learning approach to enable high precision identification of Twitter user-generated COVID-19 symptom discussions. The model is a digital epidemiology tool that can identify social media users who self-report symptoms during the early periods of an outbreak.
机译:在全球超过12700万个案件,Covid-19 Pandemic在全球健康中标志着一个哨兵活动。然而,由于缺乏测试和诊断能力,无症状案例和未经测试或寻求护理的个人而难以难以忽视。始终如一地,新的数字监控工具是为了检测,表征和报告CoVID-19案件的出现,包括使用来自用户在互联网和社交媒体平台上的用户自我报告的Covid-19相关经验中的结构化和非结构化数据。在这项研究中,我们开发和评估了混合无监督和监督的机器学习方法,以在大流行早期阶段检测Twitter上的自我报告的Covid-19相关症状。从3月3日从公共API流收集推文 rd -31 st 2020,过滤Covid-19相关术语。我们使用BITERM主题模型将推文集群关联组关联的组,然后提取并手动注释,以识别自我报告疑似COVID-19症状或地位的用户。使用此手动注释的数据作为培训集,我们使用了XLNet深度学习模型,用于将与较大的语料库中的症状相关的推文进行分类,并评估模型性能。从收集的4,492,954次推文中,无监督的学习过程产生3,465(<1%)症状推文,用于形成我们的地面真理Covid-19症状数据集(n = 11,550)。与对分类评估的基线模型相比,XLNET文本分类器实现了最高精度(.91)和F1(.62)。通过调整损耗函数重新训练后,我们将分类器的精确度提升至0.81,同时保持高F1(0.66),导致识别额外的2,622个与症状相关的推文,当申请收集的额外11天。我们的研究使用混合机器学习方法来实现Twitter用户生成的Covid-19症状讨论的高精度识别。该模型是一种数字流行病学工具,可以识别在疫情的早期自我报告症状的社交媒体用户。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号