首页> 外文期刊>Applied Sciences >Learning Word Embeddings with Chi-Square Weights for Healthcare Tweet Classification
【24h】

Learning Word Embeddings with Chi-Square Weights for Healthcare Tweet Classification

机译:使用卡方权重学习单词嵌入以进行医疗保健Tweet分类

获取原文
           

摘要

Twitter is a popular source for the monitoring of healthcare information and public disease. However, there exists much noise in the tweets. Even though appropriate keywords appear in the tweets, they do not guarantee the identification of a truly health-related tweet. Thus, the traditional keyword-based classification task is largely ineffective. Algorithms for word embeddings have proved to be useful in many natural language processing (NLP) tasks. We introduce two algorithms based on an existing word embedding learning algorithm: the continuous bag-of-words model (CBOW). We apply the proposed algorithms to the task of recognizing healthcare-related tweets. In the CBOW model, the vector representation of words is learned from their contexts. To simplify the computation, the context is represented by an average of all words inside the context window. However, not all words in the context window contribute equally to the prediction of the target word. Greedily incorporating all the words in the context window will largely limit the contribution of the useful semantic words and bring noisy or irrelevant words into the learning process, while existing word embedding algorithms also try to learn a weighted CBOW model. Their weights are based on existing pre-defined syntactic rules while ignoring the task of the learned embedding. We propose learning weights based on the words’ relative importance in the classification task. Our intuition is that such learned weights place more emphasis on words that have comparatively more to contribute to the later task. We evaluate the embeddings learned from our algorithms on two healthcare-related datasets. The experimental results demonstrate that embeddings learned from the proposed algorithms outperform existing techniques by a relative accuracy improvement of over 9%.
机译:Twitter是监视医疗保健信息和公共疾病的流行资源。但是,这些推文中存在很多杂音。即使适当的关键字出现在推文中,它们也不能保证识别出真正与健康相关的推文。因此,传统的基于关键字的分类任务在很大程度上是无效的。事实证明,词嵌入算法在许多自然语言处理(NLP)任务中很有用。我们基于现有的词嵌入学习算法介绍两种算法:连续词袋模型(CBOW)。我们将提出的算法应用于识别与医疗保健相关的推文的任务。在CBOW模型中,单词的向量表示是从它们的上下文中学习的。为了简化计算,上下文由上下文窗口内所有单词的平均值表示。但是,并非上下文窗口中的所有单词都对目标单词的预测做出同等的贡献。贪婪地将所有单词合并到上下文窗口中将极大地限制有用语义单词的贡献,并将嘈杂或无关的单词带入学习过程,而现有的单词嵌入算法也尝试学习加权CBOW模型。它们的权重基于现有的预定义语法规则,而忽略了学习的嵌入任务。我们建议根据单词在分类任务中的相对重要性来学习权重。我们的直觉是,这样学到的权重将更多的重点放在单词上,这些单词相对地更多地有助于以后的任务。我们评估从我们的算法中学到的嵌入在两个与医疗保健相关的数据集上。实验结果表明,从提出的算法中学到的嵌入优于现有技术,相对精度提高了9%以上。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号