...
首页> 外文期刊>Knowledge and information systems >Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining
【24h】

Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining

机译:在多语言文本挖掘中轻监督下获取命名实体和语言模式

获取原文
获取原文并翻译 | 示例

摘要

Named Entity Recognition and Classification (NERC) is an important component of applications like Opinion Tracking, Information Extraction, or Question Answering. When these applications require to work in several languages, NERC becomes a bottleneck because its development requires language-specific tools and resources like lists of names or annotated corpora. This paper presents a lightly supervised system that acquires lists of names and linguistic patterns from large raw text collections in western languages and starting with only a few seeds per class selected by a human expert. Experiments have been carried out with English and Spanish news collections and with the Spanish Wikipedia. Evaluation of NE classification on standard datasets shows that NE lists achieve high precision and reveals that contextual patterns increase recall significantly. Therefore, it would be helpful for applications where annotated NERC data are not available such as those that have to deal with several western languages or information from different domains.
机译:命名实体识别和分类(NERC)是诸如意见跟踪,信息提取或问题解答之类的应用程序的重要组成部分。当这些应用程序需要使用多种语言工作时,NERC成为瓶颈,因为它的开发需要特定于语言的工具和资源,例如名称列表或带注释的语料库。本文提出了一个受轻微监督的系统,该系统从西方语言的大量原始文本集合中获取名称和语言模式的列表,并且从人类专家选择的每个类别中仅获取少量种子开始。已经对英语和西班牙语新闻集以及西班牙语维基百科进行了实验。在标准数据集上对网元分类的评估表明,网元列表实现了较高的准确性,并表明上下文模式显着提高了召回率。因此,这对于不提供带注释的NERC数据的应用程序(例如那些必须处理几种西方语言或来自不同域的信息的应用程序)将很有帮助。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号