首页> 外文期刊>BMC Bioinformatics >Incorporating rich background knowledge for gene named entity classification and recognition
【24h】

Incorporating rich background knowledge for gene named entity classification and recognition

机译:融入了名为实体分类和识别的基因的丰富背景知识

获取原文
           

摘要

Background Gene named entity classification and recognition are crucial preliminary steps of text mining in biomedical literature. Machine learning based methods have been used in this area with great success. In most state-of-the-art systems, elaborately designed lexical features, such as words, n-grams, and morphology patterns, have played a central part. However, this type of feature tends to cause extreme sparseness in feature space. As a result, out-of-vocabulary (OOV) terms in the training data are not modeled well due to lack of information. Results We propose a general framework for gene named entity representation, called feature coupling generalization (FCG). The basic idea is to generate higher level features using term frequency and co-occurrence information of highly indicative features in huge amount of unlabeled data. We examine its performance in a named entity classification task, which is designed to remove non-gene entries in a large dictionary derived from online resources. The results show that new features generated by FCG outperform lexical features by 5.97 F-score and 10.85 for OOV terms. Also in this framework each extension yields significant improvements and the sparse lexical features can be transformed into both a lower dimensional and more informative representation. A forward maximum match method based on the refined dictionary produces an F-score of 86.2 on BioCreative 2 GM test set. Then we combined the dictionary with a conditional random field (CRF) based gene mention tagger, achieving an F-score of 89.05, which improves the performance of the CRF-based tagger by 4.46 with little impact on the efficiency of the recognition system. A demo of the NER system is available at http://202.118.75.18:8080/bioner .
机译:作为实体分类和识别的背景基因是生物医学文献中文发矿的关键初步步骤。基于机器的基于机器的方法已经在这一领域使用了巨大的成功。在最先进的系统中,精心设计的词汇特征,例如单词,n-gram和形态模式,已经发挥了一个中心部分。然而,这种类型的特征往往会在特征空间中引起极端稀疏性。因此,由于缺乏信息,培训数据中的词汇量(OOV)术语不会很好地建模。结果我们提出了一个名为实体表示的基因框架,称为特征耦合概括(FCG)。基本思想是使用大量未标记数据的高度指示特征的术语频率和共同发生信息产生更高的级别特征。我们在命名实体分类任务中检查其性能,该任务旨在删除来自在线资源的大词典中的非基因条目。结果表明,FCG优先表现出的新功能,以5.97 F分数和10.85为10.85个。同样在该框架中,每个扩展产生显着的改进,并且稀疏词汇特征可以转换为较低的维度和更具信息化的表示。基于精细词典的前向最大匹配方法在BioCreative 2 GM测试集上产生86.2的F分。然后我们将基于条件随机字段(CRF)的基因组合了标签的标签,实现了89.05的F分,这提高了基于CRF的标签的性能4.46,几乎没有影响识别系统的效率。网上系统的演示可在http://202.118.75.18:8080/1AILER中获得。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号