首页> 外文会议>Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies >A Systematic Study of Leveraging Subword Information for Learning Word Representations
【24h】

A Systematic Study of Leveraging Subword Information for Learning Word Representations

机译:利用子词信息学习词表示的系统研究

获取原文

摘要

The use of subword-level information (e.g., characters, character n-grams, morphemes) has become ubiquitous in modern word representation learning. Its importance is attested especially for morphologically rich languages which generate a large number of rare words. Despite a steadily increasing interest in such subword-informed word representations, their systematic comparative analysis across typo-logically diverse languages and different tasks is still missing. In this work, we deliver such a study focusing on the variation of two crucial components required for subword-level integration into word representation models: 1) segmentation of words into subword units, and 2) subword composition functions to obtain final word representations. We propose a general framework for learning subword-informed word representations that allows for easy experimentation with different segmentation and composition components, also including more advanced techniques based on position em-beddings and self-attention. Using the unified framework, we run experiments over a large number of subword-informed word representation configurations (60 in total) on 3 tasks (general and rare word similarity, dependency parsing, fine-grained entity typing) for 5 languages representing 3 language types. Our main results clearly indicate that there is no "one-size-fits-all" configuration, as performance is both language- and task-dependent. We also show that configurations based on unsupervised segmentation (e.g., BPE, Morfessor) are sometimes comparable to or even outperform the ones based on supervised word segmentation.
机译:在现代单词表示学习中,使用子单词级别的信息(例如,字符,字符n-gram,词素)已变得无处不在。尤其是对于形态丰富的语言(产生大量稀有单词)证明了其重要性。尽管人们对这种以子词表示的词表示形式的兴趣不断增长,但是它们仍然缺乏对类型逻辑上不同的语言和不同任务的系统比较分析。在这项工作中,我们进行这样的研究,着重于将子词级集成到词表示模型中所需的两个关键组件的变化:1)将词分割成子词单元,以及2)子词组合功能以获得最终的词表示。我们提出了一个学习子词告知词表示形式的通用框架,该框架可让您轻松地尝试使用不同的细分和组成成分,还包括基于位置嵌入和自我注意的更先进的技术。使用统一框架,我们针对代表3种语言类型的5种语言的3种任务(常规和稀有词相似性,依赖项解析,细粒度实体键入)对大量子词告知的词表示配置(总共60个)进行了实验。我们的主要结果清楚地表明,由于性能取决于语言和任务,因此没有“一刀切”的配置。我们还显示,基于无监督分词的配置(例如BPE,Morfessor)有时可与基于监督分词的配置媲美甚至胜过。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号