首页> 外文期刊>BioMed research international >ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition
【24h】

ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition

机译:ChemTok:一种新的基于规则的化学品名称实体识别牌

获取原文
获取原文并翻译 | 示例
           

摘要

Named Entity Recognition (NER) from text constitutes the first step in many text mining applications. The most important preliminary step for NER systems using machine learning approaches is tokenization where raw text is segmented into tokens. This study proposes an enhanced rule based tokenizer, ChemTok, which utilizes rules extracted mainly from the train data set. The main novelty of ChemTok is the use of the extracted rules in order to merge the tokens split in the previous steps, thus producing longer and more discriminative tokens. ChemTok is compared to the tokenization methods utilized by ChemSpot and tmChem. Support Vector Machines and Conditional Random Fields are employed as the learning algorithms. The experimental results show that the classifiers trained on the output of ChemTok outperforms all classifiers trained on the output of the other two tokenizers in terms of classification performance, and the number of incorrectly segmented entities.
机译:来自文本的命名实体识别(ner)构成了许多文本挖掘应用程序中的第一步。 使用机器学习方法的NER系统最重要的初步步骤是令牌化,原始文本被分段为令牌。 本研究提出了一种增强的规则基于令牌化器,ChemTok,它利用主要从列车数据集中提取的规则。 ChemTok的主要新颖性是使用提取的规则,以便在前一步骤中合并令牌分裂,从而产生更长且更辨别的令牌。 ChemTok与ChemPot和TMChem使用的销量化方法进行比较。 支持向量机和条件随机字段作为学习算法。 实验结果表明,在ChemTok的输出上训练的分类器优于在分类性能方面的所有分类器上训练,以及不正确的分段实体的数量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号