Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling

Liangping Ding; Zhixiong Zhang; Huan Liu; Jie Li; Gaihong Yu

首页> 外文期刊>Journal of Data and Information Science >Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling

【24h】

Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling

机译：基于特征级序列标记的科学中国医疗摘要自动关键词提取

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Purpose Automatic keyphrase extraction (AKE) is an important task for grasping the main points of the text. In this paper, we aim to combine the benefits of sequence labeling formulation and pretrained language model to propose an automatic keyphrase extraction model for Chinese scientific research. Design/methodology/approach We regard AKE from Chinese text as a character-level sequence labeling task to avoid segmentation errors of Chinese tokenizer and initialize our model with pretrained language model BERT, which was released by Google in 2018. We collect data from Chinese Science Citation Database and construct a large-scale dataset from medical domain, which contains 100,000 abstracts as training set, 6,000 abstracts as development set and 3,094 abstracts as test set. We use unsupervised keyphrase extraction methods including term frequency (TF), TF-IDF, TextRank and supervised machine learning methods including Conditional Random Field (CRF), Bidirectional Long Short Term Memory Network (BiLSTM), and BiLSTM-CRF as baselines. Experiments are designed to compare word-level and character-level sequence labeling approaches on supervised machine learning models and BERT-based models. Findings Compared with character-level BiLSTM-CRF, the best baseline model with F1 score of 50.16%, our character-level sequence labeling model based on BERT obtains F1 score of 59.80%, getting 9.64% absolute improvement. Research limitations We just consider automatic keyphrase extraction task rather than keyphrase generation task, so only keyphrases that are occurred in the given text can be extracted. In addition, our proposed dataset is not suitable for dealing with nested keyphrases. Practical implications We make our character-level IOB format dataset of Chinese Automatic Keyphrase Extraction from scientific Chinese medical abstracts (CAKE) publicly available for the benefits of research community, which is available at: https://github.com/possible1402/Dataset-For-Chinese-Medical-Keyphrase-Extraction . Originality/value By designing comparative experiments, our study demonstrates that character-level formulation is more suitable for Chinese automatic keyphrase extraction task under the general trend of pretrained language models. And our proposed dataset provides a unified method for model evaluation and can promote the development of Chinese automatic keyphrase extraction to some extent.

机译：目的自动关键字提取（AKE）是掌握文本要点的重要任务。本文旨在将序列标记配方和预赖尔语言模型的益处结合起来，提出了一种用于中国科学研究的自动关键关键词提取模型。设计/方法/方法我们将中文文本中的AKE视为一个字符级序列标记任务，以避免中文销量的分割错误，并使用谷歌在2018年通过预付费语言模型BERT初始化我们的模型。我们收集来自中国科学的数据引文数据库并从医疗域构建一个大规模数据集，其中包含100,000个摘要作为培训集，6,000个摘要作为开发集和3,094个摘要作为测试集。我们使用无监督的关键词提取方法，包括术语频率（TF），TF-IDF，TEXTRANK和监督机器学习方法，包括条件随机字段（CRF），双向长短短期内存网络（BILSTM）和Bilstm-CRF作为基准。实验旨在比较关于监督机器学习模型和基于BERT的模型的字样和字符级序列标记方法。调查结果与字符级Bilstm-CRF相比，最佳基线模型，F1得分为50.16％，我们的性格级序列标记模型基于BERT获得F1得分为59.80％，达到9.64％的绝对改善。研究限制我们只需考虑自动关键字提取任务而不是关键字生成任务，因此只能提取给定文本中发生的关键次数。此外，我们提出的数据集不适合处理嵌套的关键效果。实际意义我们使我们的性格IOB IOB格式DataSet从科学中文医疗摘要（Cake）公开可用于研究社区的好处，可用于：https://github.com/possible1402/dataset-用于中医 - 后关键 - 关键 - 萃取。我们的研究通过设计比较实验，表明，在预付费语言模型的总趋势下，人物级配方更适合中国自动关键疗程提取任务。我们的建议数据集提供了统一的模型评估方法，并可以在一定程度上促进中国自动关键关键的开发。

著录项

来源
《Journal of Data and Information Science》 |2021年第3期|共23页
作者
Liangping Ding; Zhixiong Zhang; Huan Liu; Jie Li; Gaihong Yu;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类图书馆学、图书馆事业;
关键词
Automatic keyphrase extractionCharacter-level sequence labelingPretrained language modelScientific chinese medical abstracts;

机译：自动关键短语提取特写级序列标签型号型号型中国医疗摘要;

相似文献

外文文献
中文文献
专利

1. Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling [J] . Liangping Ding, Zhixiong Zhang, Huan Liu, 数据与情报科学学报：英文版 . 2021,第003期

机译：基于特征级序列标记的科学中国医疗摘要自动关键词提取
2. Automatic keyphrase extraction from scientific articles [J] . Su Nam Kim, Olena Medelyan, Min-Yen Kan, Language Resources and Evaluation . 2013,第3期

机译：从科学文章中自动提取关键词
3. Automatic keyphrase extraction from scientific articles [J] . Su Nam Kim, Olena Medelyan, Min-Yen Kan, Language Resources and Evaluation . 2013,第3期

机译：从科学文章中自动提取关键词
4. PTR: Phrase-Based Topical Ranking for Automatic Keyphrase Extraction in Scientific Publications [C] . Minmei Wang, Bo Zhao, Yihua Huang International conference on neural information processing . 2016

机译：PTR：基于短语的主题排名，用于科学出版物中的自动关键词提取
5. Evaluation techniques and graph-based algorithms for automatic summarization and keyphrase extraction. [D] . Hamid, Fahmida. 2016

机译：自动汇总和关键短语提取的评估技术和基于图的算法。
6. Deep neural model with self-training for scientific keyphrase extraction [O] . Xun Zhu, Chen Lyu, Donghong Ji, 2020

机译：具有自我训练的深度神经模型用于科学关键训练
7. Semi-supervised Sequence Labeling for Named Entity Extraction based on Tri-Training: Case Study on Chinese Person Name Extraction [O] . Chien-lung Chou, Chia-hui Chang, Shin-yi Wu 2015

机译：基于三训练的命名实体提取半监督序列标注：中国人名提取案例研究
8. Knowledge Based Automatic Extraction of the Machinable Surfaces for Automatic CAD-CAM (Computer Aided Design-Computer Aided Manufacturing) System [R] . Rodrigues, V., Vescovi, M. R. 1988

机译：基于知识的自动CaD-Cam（计算机辅助设计 - 计算机辅助制造）系统可加工表面的自动提取

Automatic Keyphrase Extraction from Scientific Chinese Medical Abstracts Based on Character-Level Sequence Labeling

摘要

著录项

相似文献

相关主题

期刊订阅