首页> 外文会议>Information Retrieval Technology >Enhancing Biomedical Named Entity Classification Using Terabyte Unlabeled Data
【24h】

Enhancing Biomedical Named Entity Classification Using Terabyte Unlabeled Data

机译:使用TB的未标记数据增强生物医学命名实体的分类

获取原文

摘要

This paper presents a semi-supervised learning method to enhance biomedical named entity classification using features generated from labeled and terabyte unlabeled data, called Feature Coupling Degree (FCD) features. Highly discriminative context words are obtained from labeled free text using Chi-square method and queries formed by combining the named entity and context words are retrieved by search engine. Then the retrieved web page counts are converted into binary features by discretization. We investigate the effect of this type of feature in a biomedical corpus generated from several online resources. Support Vector Machine (SVM) is used as classifier and the performances of different features with various kernels and discretization methods are compared. The results show that the method enhances the classification performance especially for Out-of-Vocabulary (OOV) terms and relative small size of training data. In addition, only using FCD features with polynomial kernels, the performance is competitive to classical features.
机译:本文提出了一种半监督学习方法,该方法利用从标记和TB级未标记数据生成的特征(称为特征耦合度(FCD)特征)来增强生物医学命名实体分类。使用卡方方法从标记的自由文本中获得具有高度区分性的上下文词,并且通过组合命名实体和上下文词构成的查询由搜索引擎检索。然后,通过离散化将检索到的网页计数转换为二进制特征。我们调查了这种功能在从几个在线资源生成的生物医学语料库中的作用。支持向量机(SVM)被用作分类器,并比较了具有各种内核和离散化方法的不同功能的性能。结果表明,该方法增强了分类性能,尤其是针对词汇量(OOV)术语和相对较小的训练数据而言。此外,仅将FCD特征与多项式内核一起使用,其性能才能与经典特征相比。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号