首页> 外文会议>International Conference on Language Resources and Evaluation >CLEEK: A Chinese Long-text Corpus for Entity Linking
【24h】

CLEEK: A Chinese Long-text Corpus for Entity Linking

机译:Cleek:一个用于实体链接的中国长文本语料库

获取原文
获取外文期刊封面目录资料

摘要

Entity linking, as one of the fundamental tasks in natural language processing, is crucial to knowledge fusion, knowledge base construction and update. Nevertheless, in contrast to the research on entity linking for English text, which undergoes continuous development, the Chinese counterpart is still in its infancy. One prominent issue lies in publicly available annotated datasets and evaluation benchmarks, which are lacking and deficient. In specific, existing Chinese corpora for entity linking were mainly constructed from noisy short texts, such as microblogs and news headings, where long texts were largely overlooked, which yet constitute a wider spectrum of real-life scenarios. To address the issue, in this work, we build CLEEK, a Chinese corpus of multi-domain long text for entity linking, in order to encourage advancement of entity linking in languages besides English. The corpus consists of 100 documents from diverse domains, and is publicly accessible. Moreover, we devise a measure to evaluate the difficulty of documents with respect to entity linking, which is then used to characterize the corpus. Additionally, the results of two baselines and seven state-of-the-art solutions on CLEEK are reported and compared. The empirical results validate the usefulness of CLEEK and the effectiveness of proposed difficulty measure.
机译:实体链接作为自然语言处理的基本任务之一,对知识融合,知识库建设和更新至关重要。尽管如此,与对持续发展的英语文本的实体联系的研究相比,中国同行仍处于初期阶段。一个突出的问题在于公开提供的注​​释数据集和评估基准,这缺乏和缺乏。具体而言,现有的实体链接中国语料表主要由嘈杂的短文构成,例如微博和新闻标题,其中很长的文本在很大程度上被忽视,这却构成了更广泛的现实生活场景。为了解决这个问题,在这项工作中,我们构建了克莱克,这是一个德国的多域长文本的实体链接,以便除了英语之外,鼓励各位的实体链接语言。语料库由来自不同域的100个文件组成,可公开访问。此外,我们设计了一种措施来评估关于实体链接的文档的难度,然后用于表征语料库。另外,报告了两种基线和七种最先进的统计数据解决方案的结果。经验结果验证了统计统计的有效性以及提出难度措施的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号