首页> 外文期刊>Quality Control, Transactions >CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions
【24h】

CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions

机译:Cregex:基于自动生成的正则表达式的生物医学文本分类器

获取原文
获取原文并翻译 | 示例
       

摘要

High accuracy text classifiers are used nowadays in organizing large amounts of biomedical information and supporting clinical decision-making processes. In medical informatics, regular expression-based classifiers have emerged as an alternative to traditional, discriminative classification algorithms due to their ability to model sequential patterns. This article presents CREGEX (Classifier Regular Expression), a biomedical text classifier based on an automatically generated regular-expressions-based feature space. We conceived an algorithm for automatically constructing an informative and discriminative regular-expressions-based feature space, suitable for binary and multiclass discrimination problems. Regular expressions are automatically generated from training texts using a coarse-to-fine text aligning method, which trades off the lexical variants of words, in terms of gender and grammatical number, and the generation of a feature space containing a large number of noisy features. CREGEX carries out feature selection by filtering keywords and also computes a confidence metric to classify test texts. Three de-identified datasets in Spanish, with information on smoking habits, obesity, and obesity types, were used here to assess the performance of CREGEX. For comparison, Support Vector Machine (SVM) and Na & x00EF;ve Bayes (NB) supervised classifiers were also trained with consecutive sequences of tokens (n-grams) as features. Results show that, in all the datasets used for evaluation, CREGEX not only outperformed both the SVM and NB classifiers in terms of accuracy and F-measure (p-value & x003C;0.05) but also used a fewer amount of training examples to achieve the same performance. Such a superior performance is attributed to the regular expressions; ability to represent complex text patterns.
机译:现在使用高精度文本分类器在组织大量的生物医学信息并支持临床决策过程中使用。在医学信息学中,由于它们的序列模式的能力,基于正则基于表达式的分类器作为传统的判别分类算法的替代方案。本文介绍了Cregex(分类器正则表达式),基于自动生成的基于常规表达式的特征空间的生物医学文本分类器。我们构思了一种用于自动构建基于信息和识别的常规表达式的特征空间的算法,适用于二进制和多标准辨别问题。使用粗略到精细的文本对齐方法从培训文本自动生成正则表达式,该方法在性别和语法编号方面交易词汇变种,以及包含大量嘈杂功能的特征空间的生成。 Cregex通过过滤关键字进行功能选择,并计算置信度量来对测试文本进行分类。在此处使用三个以西班牙语的De-Identified数据集,其中包含有关吸烟习惯,肥胖和肥胖类型的信息,以评估Cregex的表现。对于比较,支持向量机(SVM)和NA&X00EF; ve Bayes(NB)监督分类器也接受了连续的令牌(n-gram)作为特征的探讨。结果表明,在用于评估的所有数据集中,CREGEX在精度和F测量方面不仅超越了SVM和NB分类器(P值和X003C; 0.05),但也使用了更少数量的训练示例来实现同样的表现。这种卓越的性能归因于正则表达式;能够表示复杂的文本模式。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号