...
首页> 外文期刊>BMC Bioinformatics >Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm
【24h】

Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm

机译:使用令牌格子设计模式和适应的维特比算法构建生物医学销售器

获取原文
   

获取外文期刊封面封底 >>

       

摘要

BackgroundTokenization is an important component of language processing yet there is no widely accepted tokenization method for English texts, including biomedical texts. Other than rule based techniques, tokenization in the biomedical domain has been regarded as a classification task. Biomedical classifier-based tokenizers either split or join textual objects through classification to form tokens. The idiosyncratic nature of each biomedical tokenizer’s output complicates adoption and reuse. Furthermore, biomedical tokenizers generally lack guidance on how to apply an existing tokenizer to a new domain (subdomain). We identify and complete a novel tokenizer design pattern and suggest a systematic approach to tokenizer creation. We implement a tokenizer based on our design pattern that combines regular expressions and machine learning. Our machine learning approach differs from the previous split-join classification approaches. We evaluate our approach against three other tokenizers on the task of tokenizing biomedical text.ResultsMedpost and our adapted Viterbi tokenizer performed best with a 92.9% and 92.4% accuracy respectively.ConclusionsOur evaluation of our design pattern and guidelines supports our claim that the design pattern and guidelines are a viable approach to tokenizer construction (producing tokenizers matching leading custom-built tokenizers in a particular domain). Our evaluation also demonstrates that ambiguous tokenizations can be disambiguated through POS tagging. In doing so, POS tag sequences and training data have a significant impact on proper text tokenization.
机译:BackgroundToken化是语言处理的重要组成部分,但没有广泛接受的英语文本销有化方法,包括生物医学文本。除了规则的基础技术之外,生物医学域中的标记已被视为分类任务。基于生物医学分类器的标记通过分类来拆分或加入文本对象来形成令牌。每个生物医学销售器的产出的特质性质使采用和重用复杂化。此外,生物医学标记普遍缺乏有关如何将现有销售器应用于新域(子域)的指导。我们识别并完成新颖的标记器设计模式,并提出了一种系统的销量创作方法。我们根据我们的设计模式实现了一个牌子,结合了正则表达式和机器学习。我们的机器学习方法与先前的分流分类方法不同。我们评估我们对授权生物医学文本的任务的三个其他标记的方法。评论阶段和我们的改编维特比牌牌牌,分别表现为92.9%和92.4%.Cluclusoursour对我们的设计模式和指南的评估支持我们的设计模式和指南指南是一种可行的销有施工方法(在特定领域中生产匹配领先的定制标记器的标记)。我们的评估还展示了模糊的令牌,可以通过POS标记消灭。这样做,POS标签序列和培训数据对适当的文本标记产生了重大影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号