Biomedical named entity recognition using deep neural networks with contextual information

Hyejin Cho; Hyunju Lee

摘要

BackgroundWith the increasing number of biomedical articles and resources, searching for and extracting valuable information has become challenging [1]. Researchers consider multiple information sources and transform unstructured text data into refined knowledge to facilitate research productivity [2, 3]. However, manual annotation and feature generation by biomedical experts are inefficient because they involve a complex process and require expensive and time-consuming labor [4]. Therefore, efficient and accurate natural language processing (NLP) techniques are becoming increasingly important for use in computational data analysis, and advanced text mining techniques are necessary to automatically analyze the biomedical literature and extract useful information from texts [5–8].For extracting valuable information, such as relationships among objects, the identification of significant terms from texts is important. Meaningful terms or phrases in a domain, which can be distinguished from similar objects, are called named entities, and named entity recognition (NER) is one of the important tasks for automatically identifying these named entities in text and classifying them into pre-defined entity types [9, 10]. NER should be performed prior to tasks, such as relation extraction, because annotated mentions play an important role in research on text mining. In the biological domain, a fundamental task of biomedical NLP is the recognition of named entities, such as genes, diseases, chemicals, and drug names, from texts. However, biomedical NER is a particularly complex task because biological entities (i) continually increase with new discoveries, (ii) have large numbers of synonyms, (iii) are often referred to using abbreviations, (iv) are described by long phrases, and (v) are mixtures of letters, symbols, and punctuation [11, 12]. Several approaches have been proposed to solve these problems [1].Most early methods for biomedical NER relied on dictionary- or rule-based approaches. NER systems using a dictionary-based method extract named entities in pre-defined dictionaries that consist of large collections of names for each entity type. Another NER system, using the rule-based approach, recognizes named entities by means of several rules that are manually defined based on their textual patterns [7, 9, 13]. The majority of these traditional approaches have shown significant improvements in terms of coverage and robustness, but rely heavily on a set of words in well-defined dictionaries and hand-crafted rules. Moreover, although relatively well-constructed dictionaries are available for common biological entities, such as disease and gene names, dictionaries for many other biological entities are not comprehensive or adequate [11]. In the case of rule-based methods, pre-defined patterns also depend on the specific textual properties of an entity class. In other words, entity-specific dictionaries and patterns require time-consuming processes and expert knowledge [7, 8].To address the shortcomings of past approaches, traditional NER methods have been replaced by supervised machine learning methods, including hidden Markov models, maximum entropy Markov models, conditional random fields (CRFs), and the support vector machine [14–17]. Furthermore, machine learning methods are often combined with various others to yield hybrid approaches that are more accurate [18, 19]. Although most machine learning approaches have led to significant improvements in NER, and despite several general-purpose NER tools based on machine learning methods being available, they are still limited in terms of reliance on hand-crafted features and human labor for feature engineering [20–22].Deep learning approaches using a large number of unstructured data items have lately drawn research interest and have been applied to NLP problems with considerable success. For NER tasks in the biomedical domain, a domain-independent method based on deep learning and statistical word embeddings, such as the bi-directional long short-term memory network (BiLSTM) with CRF and GRAM-CNN, has been shown to outperform state-of-the-art entity-specific NER tools such as a disease-specific NER tool DNorm and a chemical-specific NER tool ChemSpot [12, 18, 23–26]. Recently, Devlin et al. proposed a new architecture named BERT [27] for NLP. BERT (Bi-directional Encoder Representations from Transformers) is a deep bi-directional pre-trained self-attention model by the Transformer [28] and uses more than 2.5 billion words for pre-training the model and obtains new state-of-the-art results on various NLP tasks, including NER.For machine learning, contextual information has already been demonstrated to lead to significant improvements [29]. Context representations usually define a collection of neighboring word embeddings in a window around the target word or an average of these window-based embeddings [30]. We propose herein an NER system designed to more explicitly

机译：背景技术越来越多的生物医学文章和资源，搜索和提取有价值的信息已经挑战[1]。研究人员考虑多个信息来源并将非结构化文本数据转换为精致的知识，以方便研究生产力[2,3]。然而，生物医学专家的手动注释和特征生成效率低下，因为它们涉及复杂的过程，并且需要昂贵且耗时的劳动力[4]。因此，有效和准确的自然语言处理（NLP）技术对于在计算数据分析中越来越重要，并且需要高级文本挖掘技术来自动分析生物医学文献并从文本中提取有用的信息[5-8]。从而提取有价值的信息，如物体之间的关系，从文本中识别重要术语很重要。可以与类似对象区分开的域中的有意义的术语或短语，称为命名实体，并命名实体识别（ner）是自动将这些命名实体自动识别文本中的重要任务之一，并将其分类为预定义实体类型[9,10]。应该在任务之类的任务之前执行，因为注释提升在文本挖掘的研究中发挥着重要作用。在生物领域，生物医学NLP的基本任务是从文本中识别名称实体，例如基因，疾病，化学品和药物名称。然而，生物医学网是一种特别复杂的任务，因为生物实体（i）与新发现不断增加，（ii）具有大量的同义词，（iii）通常使用缩写，（iv）由长短语描述，并且（v）是字母，符号和标点符号的混合[11,12]。已经提出了几种方法来解决这些问题[1]。最早的生物医学中的早期方法依赖于字典或基于规则的方法。使用基于字典的方法中的NER系统在预定定义的词典中提取名为实体的，该词典包括每个实体类型的大型名称集成。使用基于规则的方法的另一个系统通过基于其文本模式手动定义的几个规则来识别命名实体[7,9,13]。这些传统方法中的大部分都表现出覆盖范围和稳健性的重大改进，但依赖于定义明确的词典和手工制作规则的一组单词。此外，虽然相对良好构造的词典可用于常见的生物实体，例如疾病和基因名称，许多其他生物实体的字典并不全面或充足[11]。在基于规则的方法的情况下，预定义模式也取决于实体类的特定文本属性。换句话说，特定于实体的字典和模式需要耗时的流程和专家知识[7,8]。要解决过去方法的缺点，传统的ner方法已被监督机器学习方法所取代，包括隐藏的马尔可夫模型，最大熵马尔可夫模型，条件随机字段（CRF）和支持向量机[14-17]。此外，机器学习方法通常与各种其他方式组合以产生更准确的混合方法[18,19]。虽然大多数机器学习方法导致了NER的显着改进，但尽管有几个基于机器学习方法的通用网工具可用，但它们仍然有限于依赖手工制作的特征和特征工程的人工劳动力[20 -22]。使用大量非结构化数据项目的Dep学习方法最近绘制了研究兴趣，并已应用于NLP问题，具有相当大的成功。对于生物医学域中的NER任务，已经显示了基于深度学习和统计字嵌入的域的独立方法，例如使用CRF和GRAM-CNN的双向长短期存储器网络（BILSTM），以优于胜度状态-Of-艺术实体特定的ner工具，如疾病特定的ner工具dnorm和化学特定的ner工具chempot [12,18,23-26]。最近，Devlin等人。提出了一个名为BERT [27]的新架构为NLP。 BERT（来自变压器的双向编码器表示）是变压器的深度双向预训练的自我关注模型[28]，并使用超过25亿字进行预培训模型，并获得新的状态-Art在各种NLP任务中的结果，包括NER。对于机器学习，已经证明了上下文信息导致显着改进[29]。上下文表示通常在目标字周围的窗口中定义相邻单词嵌入的集合或基于窗口的嵌入的平均值[30]。我们提出了一个设计成更明确的网系统

Biomedical named entity recognition using deep neural networks with contextual information

摘要

著录项

相似文献

相关主题

期刊订阅