首页> 外文会议>10th Workshop on multiword expressions >Breaking Bad: Extraction of Verb-Particle Constructions from a Parallel Subtitles Corpus
【24h】

Breaking Bad: Extraction of Verb-Particle Constructions from a Parallel Subtitles Corpus

机译:破烂:从平行字幕语料库中提取动词-粒子结构

获取原文
获取原文并翻译 | 示例

摘要

The automatic extraction of verb-particle constructions (VPCs) is of particular interest to the NLP community. Previous studies have shown that word alignment methods can be used with parallel corpora to successfully extract a range of multi-word expressions (MWEs). In this paper the technique is applied to a new type of corpus, made up of a collection of subtitles of movies and television series, which is parallel in English and Spanish. Building on previous research, it is shown that a precision level of 94 ± 4.7% can be achieved in English VPC extraction. This high level of precision is achieved despite the difficulties of aligning and tagging subtitles data. Moreover, many of the extracted VPCs are not present in online lexical resources, highlighting the benefits of using this unique corpus type, which contains a large number of slang and other informal expressions. An added benefit of using the word alignment process is that translations are also automatically extracted for each VPC. A precision rate of 75±8.5% is found for the translations of English VPCs into Spanish. This study thus shows that VPCs are a particularly good subset of the MWE spectrum to attack using word alignment methods, and that subtitles data provide a range of interesting expressions that do not exist in other corpus types.
机译:NLP社区特别关注动词结构(VPC)的自动提取。先前的研究表明,单词对齐方法可以与并行语料库一起使用,以成功提取一系列多单词表达(MWE)。在本文中,该技术被应用于一种新型的语料库,该语料库由电影和电视连续剧的字幕集合构成,该字幕在英语和西班牙语中是平行的。在以前的研究的基础上,结果表明英语VPC提取可以达到94±4.7%的精度。尽管难以对字幕数据进行对齐和标记,但仍可以实现这种高水平的精度。而且,许多提取的VPC不在在线词汇资源中,这突显了使用这种独特的语料库类型的好处,该语料库类型包含大量的informal语和其他非正式表达。使用单词对齐过程的另一个好处是,还可以为每个VPC自动提取翻译。将英语VPC转换为西班牙语的准确率达到75±8.5%。因此,这项研究表明,VPC是MWE频谱中使用单词对齐方法进行攻击的一个特别好的子集,并且字幕数据提供了一系列其他语料库中不存在的有趣表达。

著录项

  • 来源
  • 会议地点 Gothenburg(SE)
  • 作者

    Aaron Smith;

  • 作者单位

    Department of Linguistics and Philology Uppsala University Box 635, 75126 Uppsala, Sweden;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号