首页> 外文会议>Ninth International Conference on Machine Learning and Applications >Building a Biomedical Tokenizer Using the Token Lattice Design Pattern and the Adapted Viterbi Algorithm
【24h】

Building a Biomedical Tokenizer Using the Token Lattice Design Pattern and the Adapted Viterbi Algorithm

机译:使用令牌格设计模式和自适应维特比算法构建生物医学令牌生成器

获取原文

摘要

Proper tokenization of biomedical text is a non-trivial problem. Problematic characteristics of current biomedical tokenizers include idiosyncratic tokenizer output and poor tokenizer extensibility and reuse. To address these problematic characteristics, we identified and completed a novel tokenizer design pattern for biomedical tokenizers. We separated a tokenizer into three components: a token lattice and lattice constructor, a best lattice-path chooser and token transducers. Token transducers create tokens from text. These tokens are assembled into a token lattice by the lattice constructor. The best path (tokenization) is selected from the token lattice, tokenizing the text. We applied our design pattern and our token transducer identification guidelines in the creation of a tokenizer for SNOMED CT concept descriptions and compared our tokenizer to three other tokenizer methods. Med post and our adapted Viterbi tokenizer perform best with a 90.1% and 93.7% accuracy respectively.
机译:生物医学文本的正确标记化是一个不小的问题。当前生物医学令牌生成器的问题特征包括特质令牌生成器输出以及令牌生成器可扩展性和重用性较差。为了解决这些问题特征,我们确定并完成了针对生物医学令牌生成器的新型令牌生成器设计模式。我们将令牌生成器分为三个部分:令牌晶格和晶格构造器,最佳晶格路径选择器和令牌转换器。令牌转换器可根据文本创建令牌。这些令牌由晶格构造器组装成令牌晶格。从令牌格中选择最佳路径(令牌化),对文本进行令牌化。我们在为SNOMED CT概念描述创建标记器时应用了设计模式和标记换能器识别准则,并将标记器与其他三种标记器方法进行了比较。 Med post和我们改编的Viterbi令牌生成器以90.1%和93.7%的准确性分别表现最佳。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号