首页> 美国卫生研究院文献>AMIA Annual Symposium Proceedings >Optimizing Corpus Creation for Training Word Embedding in Low Resource Domains: A Case Study in Autism Spectrum Disorder (ASD)
【2h】

Optimizing Corpus Creation for Training Word Embedding in Low Resource Domains: A Case Study in Autism Spectrum Disorder (ASD)

机译:优化语料库创建以训练低资源域中的单词嵌入:自闭症谱系障碍(ASD)的案例研究

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Automating the extraction of behavioral criteria indicative of Autism Spectrum Disorder (ASD) in electronic health records (EHRs) can contribute significantly to the effort to monitor the condition. Word embedding algorithms such as Word2Vec can encode semantic meanings of words in vectors and assist in automated vocabulary discovery from EHRs. However, text available for training word embeddings for ASD is miniscule compared to the billions of tokens typically used. We evaluate the importance of corpus specificity versus size and hypothesize that for specific domains small corpora can generate excellent word embeddings. We custom-built 6 ASD-themed corpora (N=4482), using ASD EHRs and abstracts from PubMed (N=39K) and PsychInfo (N=69K) and evaluated them. We were able to generate the most useful 200-dimension embeddings based on the small ASD EHR data. Due to diversity in its vocabulary, the abstract-based embeddings generated fewer related terms and saw minimal improvement when the size of the corpus increased.
机译:自动提取指示电子健康记录(EHR)中自闭症谱系障碍(ASD)的行为标准可以极大地有助于监测病情。诸如Word2Vec之类的词嵌入算法可以对向量中词的语义进行编码,并有助于从EHR中自动发现词汇。但是,与通常使用的数十亿个令牌相比,可用于训练ASD词嵌入的文本很小。我们评估了语料库特异性对大小的重要性,并假设对于特定领域,小型语料库可以生成出色的词嵌入。我们使用ASD EHR和PubMed(N = 39K)和PsychInfo(N = 69K)的摘要定制了6个以ASD为主题的语料库(N = 4482),并对它们进行了评估。我们可以根据小的ASD EHR数据生成最有用的200维嵌入。由于词汇量的多样性,基于摘要的嵌入产生的相关术语较少,并且随着语料库大小的增加,改进程度很小。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号