首页> 美国卫生研究院文献>Big Data >A Supervised Learning Process to Validate Online Disease Reports for Use in Predictive Models
【2h】

A Supervised Learning Process to Validate Online Disease Reports for Use in Predictive Models

机译:监督学习过程以验证在线疾病报告以用于预测模型

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Pathogen distribution models that predict spatial variation in disease occurrence require data from a large number of geographic locations to generate disease risk maps. Traditionally, this process has used data from public health reporting systems; however, using online reports of new infections could speed up the process dramatically. Data from both public health systems and online sources must be validated before they can be used, but no mechanisms exist to validate data from online media reports. We have developed a supervised learning process to validate geolocated disease outbreak data in a timely manner. The process uses three input features, the data source and two metrics derived from the location of each disease occurrence. The location of disease occurrence provides information on the probability of disease occurrence at that location based on environmental and socioeconomic factors and the distance within or outside the current known disease extent. The process also uses validation scores, generated by disease experts who review a subset of the data, to build a training data set. The aim of the supervised learning process is to generate validation scores that can be used as weights going into the pathogen distribution model. After analyzing the three input features and testing the performance of alternative processes, we selected a cascade of ensembles comprising logistic regressors. Parameter values for the training data subset size, number of predictors, and number of layers in the cascade were tested before the process was deployed. The final configuration was tested using data for two contrasting diseases (dengue and cholera), and 66%–79% of data points were assigned a validation score. The remaining data points are scored by the experts, and the results inform the training data set for the next set of predictors, as well as going to the pathogen distribution model. The new supervised learning process has been implemented within our live site and is being used to validate the data that our system uses to produce updated predictive disease maps on a weekly basis.
机译:预测疾病发生空间变化的病原体分布模型需要来自大量地理位置的数据来生成疾病风险图。传统上,该过程使用来自公共卫生报告系统的数据。但是,使用有关新感染的在线报告可以大大加快这一过程。来自公共卫生系统和在线来源的数据必须先经过验证,然后才能使用,但是尚无任何机制可以验证来自在线媒体报告的数据。我们已经开发了一种有监督的学习过程,可以及时验证地理位置疾病爆发数据。该过程使用三个输入功能,即数据源和从每个疾病发生位置得出的两个指标。疾病发生的位置根据环境和社会经济因素以及当前已知疾病范围之内或之外的距离,提供有关该位置疾病发生概率的信息。该过程还使用由疾病专家审核数据的子集生成的验证分数来构建训练数据集。监督学习过程的目的是生成验证分数,该分数可用作进入病原体分布模型的权重。在分析了三个输入特征并测试了替代过程的性能之后,我们选择了由逻辑回归组成的级联。在部署过程之前,测试了训练数据子集大小,预测变量数和级联中的层数的参数值。使用两种对比疾病(登革热和霍乱)的数据对最终配置进行了测试,并为66%–79%的数据点分配了有效评分。剩余的数据点由专家进行评分,其结果将告知用于下一组预测变量的训练数据集,以及用于病原体分布模型的信息。新的监督学习过程已在我们的实时站点中实施,用于验证我们的系统每周用来生成更新的预测性疾病图的数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号