首页> 外文会议>Annual International Conference of the IEEE Engineering in Medicine and Biology Society >Identifying and extracting patient smoking status information from clinical narrative texts in Spanish

Identifying and extracting patient smoking status information from clinical narrative texts in Spanish




In this work we present a system to identify and extract patient's smoking status from clinical narrative text in Spanish. The clinical narrative text was processed using natural language processing techniques, and annotated by four people with a biomedical background. The dataset used for classification had 2,465 documents, each one annotated with one of the four smoking status categories. We used two feature representations: single word token and bigrams. The classification problem was divided in two levels. First recognizing between smoker (S) and non-smoker (NS); second recognizing between current smoker (CS) and past smoker (PS). For each feature representation and classification level, we used two classifiers: Support Vector Machines (SVM) and Bayesian Networks (BN). We split our dataset as follows: a training set containing 66% of the available documents that was used to build classifiers and a test set containing the remaining 34% of the documents that was used to test and evaluate the model. Our results show that SVM together with the bigram representation performed better in both classification levels. For S vs NS classification level performance measures were: ACC=85%, Precision=85%, and Recall=90%. For CS vs PS classification level performance measures were: ACC=87%, Precision=91%, and Recall=94%.
机译:在这项工作中,我们提供了一个系统,该系统可以从西班牙语的临床叙事文本中识别并提取患者的吸烟状况。临床叙事文本使用自然语言处理技术处理,并由具有生物医学背景的四个人进行注释。用于分类的数据集有2,465个文档,每个文档都带有四个吸烟状态类别之一。我们使用了两种功能表示形式:单词标记和双字。分类问题分为两个级别。首先识别吸烟者(S)和非吸烟者(NS);第二个识别当前吸烟者(CS)和过去吸烟者(PS)。对于每个特征表示和分类级别,我们使用了两个分类器:支持向量机(SVM)和贝叶斯网络(BN)。我们按以下方式拆分数据集:一个训练集,其中包含用于构建分类器的66%的可用文档,以及一个测试集,其中包含用于测试和评估模型的34%的文档。我们的结果表明,SVM和bigram表示在两个分类级别上均表现更好。对于S vs NS分类级别,性能指标为:ACC = 85%,Precision = 85%和Recall = 90%。对于CS vs PS分类级别,性能指标为:ACC = 87%,Precision = 91%和Recall = 94%。



  • 外文文献
  • 中文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号