首页> 外文期刊>Journal of Data and Information Science >Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling
【24h】

Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling

机译:基于特征级序列标记的科学中国医疗摘要自动关键词提取

获取原文
           

摘要

Purpose Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research. Design/methodology/approach We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models. Findings Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement. Research limitations We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases. Practical implications We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction . Originality/value By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.
机译:目的自动关键字提取(AKE)是掌握文本要点的重要任务。本文旨在将序列标记配方和预赖尔语言模型的益处结合起来,提出了一种用于中国科学研究的自动关键关键词提取模型。设计/方法/方法我们将中文文本中的AKE视为一个字符级序列标记任务,以避免中文销量的分割错误,并使用谷歌在2018年通过预付费语言模型BERT初始化我们的模型。我们收集来自中国科学的数据引文数据库并从医疗域构建一个大规模数据集,其中包含100,000个摘要作为培训集,6,000个摘要作为开发集和3,094个摘要作为测试集。我们使用无监督的关键词提取方法,包括术语频率(TF),TF-IDF,TEXTRANK和监督机器学习方法,包括条件随机字段(CRF),双向长短短期内存网络(BILSTM)和Bilstm-CRF作为基准。实验旨在比较关于监督机器学习模型和基于BERT的模型的字样和字符级序列标记方法。调查结果与字符级Bilstm-CRF相比,最佳基线模型,F1得分为50.16%,我们的性格级序列标记模型基于BERT获得F1得分为59.80%,达到9.64%的绝对改善。研究限制我们只需考虑自动关键字提取任务而不是关键字生成任务,因此只能提取给定文本中发生的关键次数。此外,我们提出的数据集不适合处理嵌套的关键效果。实际意义我们使我们的性格IOB IOB格式DataSet从科学中文医疗摘要(Cake)公开可用于研究社区的好处,可用于:https://github.com/possible1402/dataset-用于中医 - 后关键 - 关键 - 萃取。我们的研究通过设计比较实验,表明,在预付费语言模型的总趋势下,人物级配方更适合中国自动关键疗程提取任务。我们的建议数据集提供了统一的模型评估方法,并可以在一定程度上促进中国自动关键关键的开发。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号