A Systematic Study of Leveraging Subword Information for Learning Word Representations

机译：利用子词信息学习词表示的系统研究

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The use of subword-level information (e.g., characters, character n-grams, morphemes) has become ubiquitous in modern word representation learning. Its importance is attested especially for morphologically rich languages which generate a large number of rare words. Despite a steadily increasing interest in such subword-informed word representations, their systematic comparative analysis across typo-logically diverse languages and different tasks is still missing. In this work, we deliver such a study focusing on the variation of two crucial components required for subword-level integration into word representation models: 1) segmentation of words into subword units, and 2) subword composition functions to obtain final word representations. We propose a general framework for learning subword-informed word representations that allows for easy experimentation with different segmentation and composition components, also including more advanced techniques based on position em-beddings and self-attention. Using the unified framework, we run experiments over a large number of subword-informed word representation configurations (60 in total) on 3 tasks (general and rare word similarity, dependency parsing, fine-grained entity typing) for 5 languages representing 3 language types. Our main results clearly indicate that there is no "one-size-fits-all" configuration, as performance is both language- and task-dependent. We also show that configurations based on unsupervised segmentation (e.g., BPE, Morfessor) are sometimes comparable to or even outperform the ones based on supervised word segmentation.

机译：在现代单词表示学习中，使用子单词级别的信息（例如，字符，字符n-gram，词素）已变得无处不在。尤其是对于形态丰富的语言（产生大量稀有单词）证明了其重要性。尽管人们对这种以子词表示的词表示形式的兴趣不断增长，但是它们仍然缺乏对类型逻辑上不同的语言和不同任务的系统比较分析。在这项工作中，我们进行这样的研究，着重于将子词级集成到词表示模型中所需的两个关键组件的变化：1）将词分割成子词单元，以及2）子词组合功能以获得最终的词表示。我们提出了一个学习子词告知词表示形式的通用框架，该框架可让您轻松地尝试使用不同的细分和组成成分，还包括基于位置嵌入和自我注意的更先进的技术。使用统一框架，我们针对代表3种语言类型的5种语言的3种任务（常规和稀有词相似性，依赖项解析，细粒度实体键入）对大量子词告知的词表示配置（总共60个）进行了实验。我们的主要结果清楚地表明，由于性能取决于语言和任务，因此没有“一刀切”的配置。我们还显示，基于无监督分词的配置（例如BPE，Morfessor）有时可与基于监督分词的配置媲美甚至胜过。

著录项

来源
《Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies》|2019年|912-932|共21页
会议地点
作者
Yi Zhu; Ivan Vulic; Anna Korhonen;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
入库时间 2022-08-26 14:35:21

相似文献

外文文献
中文文献
专利

1. A simple representation of subwords of the Fibonacci word [J] . Bartosz Walczak Information Processing Letters . 2010,第21期

机译：斐波那契词的子词的简单表示
2. Sounds of Speech Based Spoken Document Categorization: A Subword Representation Method [J] . Weidong QU, Katsuhiko SHIRAI IEICE Transactions on Information and Systems . 2004,第5期

机译：基于语音的语音文档分类：子词表示方法
3. Subword Attentive Model for Arabic Sentiment Analysis: A Deep Learning Approach [J] . Beseiso Majdi, Elmousalami Haytham ACM transactions on Asian and low-resource language information processing . 2020,第2期

机译：阿拉伯语情绪分析的子字分级模型：深入学习方法
4. A Systematic Study of Leveraging Subword Information for Learning Word Representations [C] . Yi Zhu, Ivan Vulic, Anna Korhonen Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . 2019

机译：利用学习词表示的子字信息的系统研究
5. A Study on Learning Representations for Relations Between Words [D] . Hakami, Huda. 2020

机译：词汇关系学习言论的研究
6. Representation Learning of Logic Words by an RNN: From Word Sequences to Robot Actions [O] . Tatsuro Yamada, Shingo Murata, Hiroaki Arie, 2017

机译：RNN对逻辑词的表示学习：从单词序列到机器人动作
7. Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations [O] . Aditi Chaudhary, Chunting Zhou, Lori Levin, 2018

机译：使用形态和语音子字表示调整单词嵌入到新语言

A Systematic Study of Leveraging Subword Information for Learning Word Representations

摘要

著录项

相似文献

相关主题

期刊订阅