首页> 中文期刊> 《小型微型计算机系统》 >一种面向医学文本数据的结构化信息抽取方法

一种面向医学文本数据的结构化信息抽取方法

         

摘要

As an important information carrier in the medical field,texts provide important data which support for clinical diagnosis and pathological research. However,texts written with the natural language are often unstructured and difficult for understanding and auto-matic processing. Especially for medical texts in Chinese,due to its strong professionalism,which requires extensive domain knowl-edge,and many short sentences are used in grammar which brings more difficulties for information extraction. Therefore,this paper proposes an approach for extracting structured information from medical text data. This approach firstly uses text clustering and key-words extraction to get commonly used expression terms in medical descriptions,and then generates the medical term database to assist Chinese segmentation to improve quality of segmentation in Chinese medical texts. Then,we analyze semantic dependency between words,and construct syntactic dependency trees for identifying and extracting key indicators with the corresponding value in medical texts from these syntactic dependency trees to obtain the structured output data. We use texts data of medical image reports for experi-ments,and experimental results show that this approach can effectively improve the quality of Chinese word segmentation,with the ac-curacy up to 98. 24% . Moreover,there are significant effects in structured knowledge extraction,with the most accuracy of 83. 76% and recall of 88. 09% . In addition,this approach can cover a variety of dependency grammar,thus has a good applicability.%医学文本作为医疗领域重要的信息载体,为临床诊断和病理学研究提供了重要的数据支持,然而使用自然语言编写的文本数据往往是非结构化的,不便于机器理解和自动化处理.对于中文的医学文本数据而言,由于专业性强,需要丰富的领域知识,并且语法上多采用短句形式,这给结构化信息的抽取带来了巨大的挑战.为此,本文设计了一种针对医学领域的文本数据进行结构化信息抽取的方法,该方法首先通过文本聚类和关键词提取来获得医学描述语言中常用的表达术语,然后使用生成的医学术语库辅助中文分词处理,以提高中文医学文本的分词质量.然后,分析词与词之间的语义依存关系并随之构建依存句法树.最后,从该句法树中识别和抽取医学文本描述中的关键指标及其对应的指标值,最终得到结构化的键值对数据.本文采用真实的医学影像报告文本作为实验数据,实验结果表明该方法有效提高了中文医学文本的分词质量,准确率最高可达98. 24% ,并在结构化的信息抽取中效果显著,具有最高83. 76%的准确率和88. 09%的召回率.本文提出的方法能覆盖多种依存语法,且有很好的适用性.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号