首页> 外文会议>International Conference on Artificial Intelligence in Medicine >Using Event-Based Web-Scraping Methods and Bidirectional Transformers to Characterize COVID-19 Outbreaks in Food Production and Retail Settings
【24h】

Using Event-Based Web-Scraping Methods and Bidirectional Transformers to Characterize COVID-19 Outbreaks in Food Production and Retail Settings

机译:使用基于事件的Web扫描方法和双向变压器来表征食品生产和零售环境中的Covid-19爆发

获取原文

摘要

Current surveillance methods may not capture the full extent of COVID-19 spread in high-risk settings like food establishments. Thus, we propose a new method for surveillance that identifies COVID-19 cases among food establishment workers from news reports via web-scraping and natural language processing (NLP). First, we used web-scraping to identify a broader set of articles (n = 67,078) related to COVID-19 based on keyword mentions. In this dataset, we used an open-source NLP platform (ClarityNLP) to extract location, industry, case, and death counts automatically. These articles were vetted and validated by CDC subject matter experts (SMEs) to identify those containing COVID-19 outbreaks in food establishments. CDC and Georgia Tech Research Institute SMEs provided a human-labeled test dataset containing 388 articles to validate our algorithms. Then, to improve quality, we fine-tuned a pre-trained RoBERTa instance, a bidirectional transformer language model, to classify articles containing ≥1 positive COVID-19 cases in food establishments. The application of RoBERTa decreased the number of articles from 67,078 to 1,112 and classified (≥1 positive COVID-19 cases in food establishments) articles with 88% accuracy in the human-labeled test dataset. Therefore, by automating the pipeline of web-scraping and COVID-19 case prediction using RoBERTa, we enable an efficient human in-the-loop process by which COVID-19 data could be manually collected from articles flagged by our model, thus reducing the human labor requirements. Furthermore, our approach could be used to predict and monitor locations of COVID-19 development by geography and could also be extended to other industries and news article datasets of interest.
机译:目前监控方法可能不会捕获Covid-19的全部范围,如食品机构。因此,我们提出了一种新的监测方法,通过Web删除和自然语言处理(NLP)从新闻报道中识别食品企业工人的Covid-19案件。首先,我们使用Web擦写来识别与Covid-19相关的更广泛的文章(n = 67,078),基于关键字提到。在此数据集中,我们使用开源NLP平台(Claritynlp)来自动提取位置,行业,案例和死亡数。这些物品被CDC主题专家(中小企业)审查并验证,以确定食品机构含有Covid-19爆发的物品。 CDC和乔治亚州科技研究所中小企业提供了一个具有388篇文章的人类标签的测试数据集来验证我们的算法。然后,为了提高质量,我们精细调整了一个预先训练的Roberta实例,一个双向变压器语言模型,分类食品机构中含有≥1个正面科约的19例的物品。 Roberta的应用将67,078至1,112的文章的数量降低,并分类(≥1个阳性Covid-19案件在食品机构中的阳性Covid-19案例),在人标记的测试数据集中具有88%的精度。因此,通过使用Roberta自动化Web扫描和CoVID-19的情况预测的管道,我们可以通过我们模型标记的文章手动收集Covid-19数据的有效的人类内循环过程,从而减少了人类劳动要求。此外,我们的方法可用于通过地理预测和监测Covid-19开发的位置,也可以扩展到其他行业和新闻文章的利益数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号