Distributed representation and estimation of WFST-based n-gram models

机译：基于WFST的n元语法模型的分布式表示和估计

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present methods for partitioning a weighted finite-state transducer (WFST) representation of an n-gram language model into multiple blocks or shards, each of which is a stand-alone WFST n-gram model in its own right, allowing processing with existing algorithms. After independent estimation, including normalization, smoothing and pruning on each shard, the shards can be reassembled into a single WFST that is identical to the model that would have resulted from estimation without sharding. We then present an approach that uses data partitions in conjunction with WFST sharding to estimate models on orders-of-magnitude more data than would have otherwise been feasible with a single process. We present some numbers on shard characteristics when large models are trained from a very large data set. Functionality to support distributed n-gram modeling has been added to the open-source OpenGrm library.

机译：我们提出了将n-gram语言模型的加权有限状态换能器（WFST）表示划分为多个块或碎片的方法，每个块或碎片本身就是一个独立的WFST n-gram模型，允许使用现有的处理算法。经过独立估计（包括对每个分片进行归一化，平滑和修剪）后，可以将这些分片重新组装为单个WFST，该WFST与无需分片的估计所产生的模型相同。然后，我们提出了一种方法，该方法将数据分区与WFST分片结合使用，以在数量级以上的数据上估计模型，这比在单个过程中要多得多。当从非常大的数据集中训练大型模型时，我们会提供一些有关分片特性的数字。支持分布式n元语法建模的功能已添加到开源OpenGrm库中。

著录项

来源
《SIGFSM workshop on statistical NLP and weighted automata》|2016年|32-41|共10页
会议地点 Berlin(DE)
作者
Cyril Allauzen; Michael Riley; Brian Roark;
展开▼
作者单位

Google, Inc.;

Google, Inc.;

Google, Inc.;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Pseudo-Conventional N-Gram Representation of the Discriminative N-Gram Model for LVCSR [J] . Zhou Z., Meng H. Selected Topics in Signal Processing, IEEE Journal of . 2010,第6期

机译：LVCSR的判别性N-Gram模型的伪常规N-Gram表示
2. Text Classification Using the N-Gram Graph Representation Model Over High Frequency Data Streams [J] . Violos John, Tserpes Konstantinos, Varlamis Iraklis, Frontiers in Applied Mathematics and Statistics . 2018,第2期

机译：使用N-Gram图表示模型对高频数据流进行文本分类
3. Spatio-temporal random fields: compressible representation and distributed estimation [J] . Nico Piatkowski, Sangkyun Lee, Katharina Morik Machine Learning . 2013,第1期

机译：时空随机场：可压缩表示和分布估计
4. Distributed representation and estimation of WFST-based n-gram models [C] . Cyril Allauzen, Michael Riley, Brian Roark Annual meeting of the Association for Computational Linguistics . 2016

机译：基于WFST的N-GRAM模型的分布式表示和估计
5. Learning Distributed Representations for Statistical Language Modelling and Collaborative Filtering. [D] . Mnih, Andriy. 2010

机译：学习用于统计语言建模和协同过滤的分布式表示形式。
6. Learning predictive models of drug side-effect relationships from distributed representations of literature-derived semantic predications [O] . Justin Mower, Devika Subramanian, Trevor Cohen 2018

机译：从文献衍生的语义谓词的分布式表示中学习药物副作用关系的预测模型
7. Pseudo-Conventional N-Gram Representation of the Discriminative N-Gram Model for LVCSR [O] . Zhengyu Zhou, Helen Meng, Senior Member 2013

机译：用于LVCsR的判别性N-Gram模型的伪常规N-Gram表示

Distributed representation and estimation of WFST-based n-gram models

摘要

著录项

相似文献

相关主题

期刊订阅