首页> 外文期刊>Information >A Two-Stage Joint Model for Domain-Specific Entity Detection and Linking Leveraging an Unlabeled Corpus
【24h】

A Two-Stage Joint Model for Domain-Specific Entity Detection and Linking Leveraging an Unlabeled Corpus

机译:利用未标记语料库的领域特定实体检测和链接的两阶段联合模型。

获取原文
           

摘要

The intensive construction of domain-specific knowledge bases (DSKB) has posed an urgent demand for researches about domain-specific entity detection and linking (DSEDL). Joint models are usually adopted in DSEDL tasks, but data imbalance and high computational complexity exist in these models. Besides, traditional feature representation methods are insufficient for domain-specific tasks, due to problems such as lack of labeled data, link sparseness in DSKBs, and so on. In this paper, a two-stage joint (TSJ) model is proposed to solve the data imbalance problem by discriminatively processing entity mentions with different degrees of ambiguity. In addition, three novel methods are put forward to generate effective features by incorporating an unlabeled corpus. One crucial feature involving entity detection is the mention type, extracted by a long short-term memory (LSTM) model trained on automatically annotated data. The other two types of features mainly involve entity linking, including the inner-document topical coherence, which is measured based on entity co-occurring relationships in the corpus, and the cross-document entity coherence evaluated using similar documents. An overall 74.26% F1 value is obtained on a dataset of real-world movie comments, demonstrating the effectiveness of the proposed approach and indicating its potentiality to be used in real-world domain-specific applications.
机译:领域专用知识库(DSKB)的密集建设对领域专用实体检测和链接(DSEDL)的研究提出了迫切的需求。 DSEDL任务中通常采用联合模型,但是这些模型中存在数据不平衡和高计算复杂性。此外,由于缺乏标签数据,DSKB中的链接稀疏等问题,传统的特征表示方法不足以完成特定于域的任务。本文提出了一种两阶段联合(TSJ)模型,通过区分处理具有不同歧义度的实体提及来解决数据不平衡问题。此外,提出了三种新方法通过合并未标记的语料来生成有效特征。涉及实体检测的一项关键功能是提及类型,该提及类型是通过对自动注释的数据进行训练的长短期记忆(LSTM)模型提取的。其他两种类型的功能主要涉及实体链接,包括基于语料库中的实体共现关系衡量的内部文档主题连贯性,以及使用相似文档评估的跨文档实体连贯性。在真实电影评论的数据集上可获得总计74.26%的F1值,这表明了所提出方法的有效性,并表明了其在真实领域特定应用中使用的潜力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号