首页> 外文期刊>Information Processing & Management >A tale of two epidemics: Contextual Word2Vec for classifying twitter streams during outbreaks
【24h】

A tale of two epidemics: Contextual Word2Vec for classifying twitter streams during outbreaks

机译:关于两个流行病的故事:上下文Word2Vec,用于在爆发期间对Twitter流进行分类

获取原文
获取原文并翻译 | 示例
       

摘要

Unstructured tweet feeds are becoming the source of real-time information for various events. However, extracting actionable information in real-time from this unstructured text data is a challenging task. Hence, researchers are employing word embedding approach to classify unstructured text data. We set our study in the contexts of the 2014 Ebola and 2016 Zika outbreaks and probed the accuracy of domain-specific word vectors for identifying crisis-related actionable tweets. Our findings suggest that relatively smaller domain-specific input corpora from the Twitter corpus are better in extracting meaningful semantic relationship than generic pre-trained Word2Vec (contrived from Google News) or GloVe (of Stanford NLP group). However, domain-specific quality tweet corpora during the early stages of outbreaks are normally scant, and identifying actionable tweets during early stages is crucial to stemming the proliferation of an outbreak. To overcome this challenge, we consider scholarly abstracts, related to Ebola and Zika virus, from PubMed and probe the efficiency of cross-domain resource utilization for word vector generation. Our findings demonstrate that the relevance of PubMed abstracts for the training purpose when Twitter data (as input corpus) would be scant during the early stages of the outbreak. Thus, this approach can be implemented to handle future outbreaks in real time. We also explore the accuracy of our word vectors for various model architectures and hyper-parameter settings. We observe that Skip-gram accuracies are better than CBOW, and higher dimensions yield better accuracy.
机译:非结构化的Twitter提要正成为各种事件的实时信息源。但是,从这种非结构化文本数据中实时提取可操作的信息是一项艰巨的任务。因此,研究人员正在采用词嵌入方法对非结构化文本数据进行分类。我们在2014年埃博拉病毒和2016年寨卡病毒爆发的背景下进行了研究,并探讨了特定领域单词向量在识别与危机相关的可行推文中的准确性。我们的发现表明,相对于一般的预训练Word2Vec(来自Google News)或GloVe(来自Stanford NLP小组),Twitter语料库中相对较小的特定于域的输入语料库在提取有意义的语义关系方面更好。但是,通常不会在爆发的早期阶段使用针对特定领域的优质推文语料库,而在早期阶段识别可行的推文对于阻止爆发的爆发至关重要。为了克服这一挑战,我们考虑了来自PubMed的与埃博拉病毒和寨卡病毒有关的学术摘要,并探讨了跨域资源利用在词向量生成中的效率。我们的研究结果表明,当暴发的早期阶段Twitter数据(作为输入语料库)很少时,PubMed摘要与培训目的的相关性。因此,可以实施此方法来实时处理将来的爆发。我们还探讨了针对各种模型架构和超参数设置的词向量的准确性。我们观察到,Skip-gram精度优于CBOW,并且更高的尺寸会产生更好的精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号