首页> 外文期刊>Computer speech and language >Comparing and combining a semantic tagger and a statistical tool for MWE extraction
【24h】

Comparing and combining a semantic tagger and a statistical tool for MWE extraction

机译:比较和组合语义标记器和统计工具以进行MWE提取

获取原文
获取原文并翻译 | 示例

摘要

Automatic extraction of multiword expressions (MWEs) presents a tough challenge for the NLP community and corpus linguistics. Indeed, although numerous knowledge-based symbolic approaches and statistically driven algorithms have been proposed, efficient MWE extraction still remains an unsolved issue. In this paper, we evaluate the Lancaster UCREL Semantic Analysis System (henceforth USAS (Rayson, P., Archer, D., Piao, S., McEnery, T., 2004. The UCREL semantic analysis system. In: Proceedings of the LREC-04 Workshop, Beyond Named Entity Recognition Semantic labelling for NLP tasks, Lisbon, Portugal. pp. 7–12)) for MWE extraction, and explore the possibility of improving USAS by incorporating a statistical algorithm. Developed at Lancaster University, the USAS system automatically annotates English corpora with semantic category information. Employing a large-scale semantically classified multi-word expression template database, the system is also capable of detecting many multiword expressions, as well as assigning semantic field information to the MWEs extracted. Whilst USAS therefore offers a unique tool for MWE extraction, allowing us to both extract and semantically classify MWEs, it can sometimes suffer from low recall. Consequently, we have been comparing USAS, which employs a symbolic approach, to a statistical tool, which is based on collocational information, in order to determine the pros and cons of these different tools, and more importantly, to examine the possibility of improving MWE extraction by combining them. As we report in this paper, we have found a highly complementary relation between the different tools: USAS missed many domain-specific MWEs (law/court terms in this case), and the statistical tool missed many commonly used MWEs that occur in low frequencies (lower than three in this case). Due to their complementary relation, we are proposing that MWE coverage can be significantly increased by combining a lexicon-based symbolic approach and a collocation-based statistical approach.
机译:自动提取多词表达式(MWE)对NLP社区和语料库语言学提出了严峻挑战。确实,尽管已经提出了许多基于知识的符号方法和统计驱动算法,但是有效的MWE提取仍然是一个尚未解决的问题。在本文中,我们评估了Lancaster UCREL语义分析系统(此后称为USAS(Rayson,P.,Archer,D.,Piao,S.,McEnery,T.,2004年)。UCREL语义分析系统。 -04研讨会,“用于NLP任务的命名实体识别语义标记之外”,葡萄牙里斯本,第7–12))页,用于MWE提取,并探索通过合并统计算法来改善USAS的可能性。由兰开斯特大学开发的USAS系统会自动用语义类别信息注释英语语料库。该系统利用大规模的语义分类多词表达模板数据库,还能够检测许多多词表达,并为提取的MWE分配语义字段信息。因此,尽管USAS为MWE提取提供了独特的工具,使我们既可以提取MWE,又可以对MWE进行语义分类,但有时召回率较低。因此,我们一直在将采用象征性方法的USAS与基于并置信息的统计工具进行比较,以确定这些不同工具的优缺点,更重要的是,研究改进MWE的可能性结合起来进行提取。正如我们在本文中所报告的,我们发现了不同工具之间的高度互补关系:USAS遗漏了许多特定领域的MWE(在这种情况下是法律/法院用语),而统计工具遗漏了许多发生在低频的常用MWE (在这种情况下小于三)。由于它们的互补关系,我们建议通过结合基于词典的符号方法和基于搭配的统计方法,可以显着增加MWE的覆盖范围。

著录项

  • 来源
    《Computer speech and language》 |2005年第4期|p. 378-397|共20页
  • 作者单位

    Department of Linguistics and Modern English Language, Lancaster University, Lancaster LA1 4YT, United Kingdom;

    Computing Department, Lancaster University, Lancaster LA1 4YT, United Kingdom;

    Department of Linguistics and Modern English Language, Lancaster University, Lancaster LA1 4YT, United Kingdom;

    Department of Linguistics and Modern English Language, Lancaster University, Lancaster LA1 4YT, United Kingdom;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 计算技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号