【24h】

Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling

机译:学习在开放词汇神经语言建模中创建和重用单词

获取原文
获取外文期刊封面目录资料

摘要

Fixed-vocabulary language models fail to account for one of the most characteristic statistical facts of natural language: the frequent creation and reuse of new word types. Although character-level language models offer a partial solution in that they can create word types not attested in the training corpus, they do not capture the "bursty" distribution of such words. In this paper, we augment a hierarchical LSTM language model that generates sequences of word tokens character by character with a caching mechanism that learns to reuse previously generated words. To validate our model we construct a new open-vocabulary language modeling corpus (the Multilingual Wikipedia Corpus; MWC) from comparable Wikipedia articles in 7 typologically diverse languages and demonstrate the effectiveness of our model across this range of languages.
机译:固定词汇语言模型无法解释自然语言的最典型统计事实之一:频繁创建和重复使用新单词类型。尽管字符级语言模型提供了部分解决方案,因为它们可以创建未经训练的语料库证明的单词类型,但它们不能捕获此类单词的“突发”分布。在本文中,我们增强了一种分层的LSTM语言模型,该模型通过学习重用先前生成的单词的缓存机制逐个字符地生成单词标记序列。为了验证我们的模型,我们从可比较的Wikipedia文章中用7种类型多样的语言构建了一个新的开放词汇语言建模语料库(Multilingual Wikipedia Corpus; MWC),并证明了我们的模型在该多种语言中的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号