首页> 中文期刊> 《计算机工程与设计》 >医疗领域文本结构化

医疗领域文本结构化

             

摘要

The effects of general-purpose word segmentation tools recognizing medical terminology are not ideal,which greatly affects the accuracy of text structure.In view of the above problem,a method of discovering new words based on word embedding was put forward.Google open source word vector tool word2vec was used to train text and to map the words into abstracted n-dimensional vector space.New words were found using the information entropy,word frequency and the internal associative strength between word and word.The key information was extracted according to the key words.As a result,the structured data were made of key words and key information.Experimental results on real medical data show that the accuracy of the proposed method is improved by 10% compared to traditional method and the efficiency of the proposed method is improved by 18% compared to traditional method.%现有通用分词工具对医疗专业术语的识别效果不理想,影响了医疗文本结构化的效果.针对该问题,提出一种基于词向量的新词发现方法,利用新词发现过程中构建的词库抽取信息,得到结构化数据.使用Google开源词向量工具word2vec训练文本,将词映射到抽象的n维向量空间;根据词与词之间的得分、词的左右信息熵和在文本中的词来发现新词,把发现的新词加入用户自定义词库;设计信息抽取规则,根据发现的关键词提取对应的关键信息,将其组织为结构化数据.实验结果表明,用该方法进行结构化处理在准确率上比传统方法提高了10%,在效率上比传统方法提高了18%.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号