CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions

Flores Christopher A.; Figueroa Rosa L.; Pezoa Jorge E.; Zeng-Treitler Qing

首页> 外文期刊>Quality Control, Transactions >CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions

【24h】

CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions

机译：Cregex：基于自动生成的正则表达式的生物医学文本分类器

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

High accuracy text classifiers are used nowadays in organizing large amounts of biomedical information and supporting clinical decision-making processes. In medical informatics, regular expression-based classifiers have emerged as an alternative to traditional, discriminative classification algorithms due to their ability to model sequential patterns. This article presents CREGEX (Classifier Regular Expression), a biomedical text classifier based on an automatically generated regular-expressions-based feature space. We conceived an algorithm for automatically constructing an informative and discriminative regular-expressions-based feature space, suitable for binary and multiclass discrimination problems. Regular expressions are automatically generated from training texts using a coarse-to-fine text aligning method, which trades off the lexical variants of words, in terms of gender and grammatical number, and the generation of a feature space containing a large number of noisy features. CREGEX carries out feature selection by filtering keywords and also computes a confidence metric to classify test texts. Three de-identified datasets in Spanish, with information on smoking habits, obesity, and obesity types, were used here to assess the performance of CREGEX. For comparison, Support Vector Machine (SVM) and Na & x00EF;ve Bayes (NB) supervised classifiers were also trained with consecutive sequences of tokens (n-grams) as features. Results show that, in all the datasets used for evaluation, CREGEX not only outperformed both the SVM and NB classifiers in terms of accuracy and F-measure (p-value & x003C;0.05) but also used a fewer amount of training examples to achieve the same performance. Such a superior performance is attributed to the regular expressions; ability to represent complex text patterns.

机译：现在使用高精度文本分类器在组织大量的生物医学信息并支持临床决策过程中使用。在医学信息学中，由于它们的序列模式的能力，基于正则基于表达式的分类器作为传统的判别分类算法的替代方案。本文介绍了Cregex（分类器正则表达式），基于自动生成的基于常规表达式的特征空间的生物医学文本分类器。我们构思了一种用于自动构建基于信息和识别的常规表达式的特征空间的算法，适用于二进制和多标准辨别问题。使用粗略到精细的文本对齐方法从培训文本自动生成正则表达式，该方法在性别和语法编号方面交易词汇变种，以及包含大量嘈杂功能的特征空间的生成。 Cregex通过过滤关键字进行功能选择，并计算置信度量来对测试文本进行分类。在此处使用三个以西班牙语的De-Identified数据集，其中包含有关吸烟习惯，肥胖和肥胖类型的信息，以评估Cregex的表现。对于比较，支持向量机（SVM）和NA＆X00EF; ve Bayes（NB）监督分类器也接受了连续的令牌（n-gram）作为特征的探讨。结果表明，在用于评估的所有数据集中，CREGEX在精度和F测量方面不仅超越了SVM和NB分类器（P值和X003C; 0.05），但也使用了更少数量的训练示例来实现同样的表现。这种卓越的性能归因于正则表达式;能够表示复杂的文本模式。

著录项

来源
《Quality Control, Transactions》 |2020年第2020期|29270-29280|共11页
作者
Flores Christopher A.; Figueroa Rosa L.; Pezoa Jorge E.; Zeng-Treitler Qing;
展开▼
作者单位

Univ Concepcion Dept Elect Engn Concepcion 4070409 Chile;

Univ Concepcion Dept Elect Engn Concepcion 4070409 Chile;

Univ Concepcion Dept Elect Engn Concepcion 4070409 Chile;

George Washington Univ Biomed Informat Ctr Washington DC 20037 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Biomedical informatics; regular expressions; sequence alignment; text classification;

机译：生物医学信息学;正则表达式;序列对齐;文本分类;

相似文献

外文文献
中文文献
专利

1. Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion [J] . Agarwal Shashank, Yu Hong Bioinformatics . 2009,第23期

机译：将全文生物医学文章中的句子自动分类为简介，方法，结果和讨论
2. Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion [J] . Shashank Agarwal and Hong Yu* Bioinformatics . 2009,第23期

机译：将全文生物医学文章中的句子自动分类为简介，方法，结果和讨论
3. Voting-Based Ensemble Classifiers to Detect Hedges and Their Scopes in Biomedical Texts [J] . Huiwei ZHOU, Xiaoyan LI, Degen HUANG, IEICE transactions on information and systems . 2011,第10期

机译：基于投票的集成分类器，以检测生物医学文本中的树篱及其范围
4. FREGEX: A Feature Extraction Method for Biomedical Text Classification using Regular Expressions [C] . Christopher A. Flores, Rosa L. Figueroa, Jorge E. Pezoa Annual International Conference of the IEEE Engineering in Medicine and Biology Society . 2019

机译：FREGEX：一种使用正则表达式进行生物医学文本分类的特征提取方法
5. Internet data extraction based on automatic regular expression inference. [D] . Lin, Ye. 2007

机译：基于自动正则表达式推断的Internet数据提取。
6. Automatically Classifying Sentences in Full-Text Biomedical Articles into Introduction Methods Results and Discussion [O] . Shashank Agarwal, Hong Yu 2009

机译：将全文生物医学文章中的句子自动分类为简介方法结果和讨论
7. CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions [O] . Christopher A. Flores, Rosa L. Figueroa, Jorge E. Pezoa, 2020

机译：Cregex：基于自动生成的正则表达式的生物医学文本分类器

CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions

摘要

著录项

相似文献

相关主题

期刊订阅