首页> 外文期刊>Scientific programming >A Low-Cost Named Entity Recognition Research Based on Active Learning
【24h】

A Low-Cost Named Entity Recognition Research Based on Active Learning

机译:基于主动学习的低成本命名实体识别研究

获取原文
       

摘要

Named entity recognition (NER) is an indispensable and very important part of many natural language processing technologies, such as information extraction, information retrieval, and intelligent Q & A. This paper describes the development of the AL-CRF model, which is a NER approach based on active learning (AL). The algorithmic sequence of the processes performed by the AL-CRF model is the following first, the samples are clustered using the k-means approach. Then, stratified sampling is performed on the produced clusters in order to obtain initial samples, which are used to train the basic conditional random field (CRF) classifier. The next step includes the initiation of the selection process which uses the criterion of entropy. More specifically, samples having the highest entropy values are added to the training set. Afterwards, the learning process is repeated, and the CRF classifier is retrained based on the obtained training set. The learning and the selection process of the AL is running iteratively until the harmonic mean F stabilizes and the final NER model is obtained. Several NER experiments are performed on legislative and medical cases in order to validate the AL-CRF performance. The testing data include Chinese judicial documents and Chinese electronic medical records (EMRs). Testing indicates that our proposed algorithm has better recognition accuracy and recall rate compared to the conventional CRF model. Moreover, the main advantage of our approach is that it requires fewer manually labelled training samples, and at the same time, it is more effective. This can result in a more cost effective and more reliable process.
机译:命名实体识别(NER)是许多自然语言处理技术(例如信息提取,信息检索和智能问答)中必不可少且非常重要的部分。本文介绍了NER的AL-CRF模型的开发基于主动学习(AL)的方法。首先由AL-CRF模型执行的过程的算法顺序如下,使用k均值方法对样本进行聚类。然后,对产生的簇进行分层采样以获得初始样本,这些初始样本用于训练基本条件随机场(CRF)分类器。下一步包括启动使用熵准则的选择过程。更具体地说,将具有最高熵值的样本添加到训练集中。之后,重复学习过程,并基于获得的训练集对CRF分类器进行再训练。 AL的学习和选择过程将反复进行,直到谐波平均值F稳定并获得最终的NER模型为止。为了验证AL-CRF的性能,对立法和医疗案件进行了多次NER实验。测试数据包括中国司法文件和中国电子病历(EMR)。测试表明,与常规CRF模型相比,我们提出的算法具有更好的识别准确性和召回率。此外,我们方法的主要优势在于,它需要的人工标记训练样本更少,同时更有效。这可以导致更具成本效益和更可靠的过程。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号