首页> 外文会议>3rd Workshop on semantic web and information extraction >Semi-supervised Sequence Labeling for Named Entity Extraction based on Tri-Training: Case Study on Chinese Person Name Extraction
【24h】

Semi-supervised Sequence Labeling for Named Entity Extraction based on Tri-Training: Case Study on Chinese Person Name Extraction

机译:基于三级训练的命名实体抽取的半监督序列标记:以中文人名抽取为例

获取原文
获取原文并翻译 | 示例

摘要

Named entity extraction is a fundamental task for many knowledge engineering applications. Existing studies rely on annotated training data, which is quite expensive when used to obtain large data sets, limiting the effectiveness of recognition. In this research, we propose an automatic labeling procedure to prepare training data from structured resources which contain known named entities. While this automatically labeled training data may contain noise, a self-testing procedure may be used as a follow-up to remove low-confidence annotation and increase the extraction performance with less training data. In addition to the preparation of labeled training data, we also employed semi-supervised learning to utilize large unlabeled training data. By modifying tri-training for sequence labeling and deriving the proper initialization, we can further improve entity extraction. In the task of Chinese personal name extraction with 364,685 sentences (8,672 news articles) and 54,449 (11,856 distinct) person names, an F-measure of 90.4% can be achieved.
机译:命名实体提取是许多知识工程应用程序的基本任务。现有研究依赖于带注释的训练数据,当用于获取大数据集时,训练数据非常昂贵,从而限制了识别的有效性。在这项研究中,我们提出了一种自动标注程序,可以从包含已知命名实体的结构化资源中准备训练数据。尽管此自动标记的训练数据可能包含噪音,但自检过程可用作后续操作,以消除低置信度注释并以较少的训练数据提高提取性能。除了准备带标签的训练数据外,我们还采用半监督学习来利用大量未标记的训练数据。通过修改用于序列标记的三训练并获得适当的初始化,我们可以进一步改善实体提取。在中文姓名提取任务中,使用364,685个句子(8,672个新闻文章)和54,449个(11,856个不同的)人名,可以实现90.4%的F测度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号