首页> 外文会议>Annual conference of the North American Chapter of the Association for Computational Linguistics: human language technologies >From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings
【24h】

From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings

机译:从语音学到语法:与语言嵌入的不同级别的无监督语言类型学

获取原文

摘要

A core part of linguistic typology is the classification of languages according to linguistic properties, such as those detailed in the World Atlas of Language Structure (WALS). Doing this manually is prohibitively time-consuming, which is in part evidenced by the fact that only 100 out of over 7,000 languages spoken in the world are fully covered in WALS. We learn distributed language representations, which can be used to predict typological properties on a massively multilingual scale. Additionally, quantitative and qualitative analyses of these language embeddings can tell us how language similarities are encoded in NLP models for tasks at different typological levels. The representations are learned in an unsupervised manner alongside tasks at three typological levels: phonology (grapheme-to-phoneme prediction, and phoneme reconstruction), morphology (morphological inflection), and syntax (part-of-speech tagging). We consider more than 800 languages and find significant differences in the language representations encoded, depending on the target task. For instance, although Norwegian Bokmal and Danish are typologically close to one another, they are phonologically distant, which is reflected in their language embeddings growing relatively distant in a phonological task. We are also able to predict typological features in WALS with high accuracies, even for unseen language families.
机译:语言类型学的核心部分是语言的分类,根据语言属性,例如在世界地图上详述的语言结构(WAL)。手动这样做是耗费耗时的,这部分证明了世界上只有超过7,000种语言中的100多种,沃尔斯覆盖。我们学习分布式语言表示,可用于预测大量多语言规模的类型化学性质。此外,这些语言嵌入的定量和定性分析可以告诉我们语言相似度如何在不同类型的类型的任务中编码。在三种类型的水平下,以无人监督的方式以无人监督的方式学习,语音学(标记为 - 音素预测和音素重建),形态(形态拐点)和语法(词组倒数标记)。我们考虑超过800种语言,并在编码的语言表示中找到显着差异,具体取决于目标任务。例如,虽然挪威Bokmal和丹麦语在一起彼此之一,但它们是语音遥远的,它们反映在他们的语言嵌入中,这些嵌入在语音任务中相对较远。我们还能够预测沃尔斯的类型学特征,即使是看不见的语言家庭。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号