首页> 外文会议>SIGFSM workshop on statistical NLP and weighted automata >Distributed representation and estimation of WFST-based n-gram models
【24h】

Distributed representation and estimation of WFST-based n-gram models

机译:基于WFST的n元语法模型的分布式表示和估计

获取原文
获取原文并翻译 | 示例

摘要

We present methods for partitioning a weighted finite-state transducer (WFST) representation of an n-gram language model into multiple blocks or shards, each of which is a stand-alone WFST n-gram model in its own right, allowing processing with existing algorithms. After independent estimation, including normalization, smoothing and pruning on each shard, the shards can be reassembled into a single WFST that is identical to the model that would have resulted from estimation without sharding. We then present an approach that uses data partitions in conjunction with WFST sharding to estimate models on orders-of-magnitude more data than would have otherwise been feasible with a single process. We present some numbers on shard characteristics when large models are trained from a very large data set. Functionality to support distributed n-gram modeling has been added to the open-source OpenGrm library.
机译:我们提出了将n-gram语言模型的加权有限状态换能器(WFST)表示划分为多个块或碎片的方法,每个块或碎片本身就是一个独立的WFST n-gram模型,允许使用现有的处理算法。经过独立估计(包括对每个分片进行归一化,平滑和修剪)后,可以将这些分片重新组装为单个WFST,该WFST与无需分片的估计所产生的模型相同。然后,我们提出了一种方法,该方法将数据分区与WFST分片结合使用,以在数量级以上的数据上估计模型,这比在单个过程中要多得多。当从非常大的数据集中训练大型模型时,我们会提供一些有关分片特性的数字。支持分布式n元语法建模的功能已添加到开源OpenGrm库中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号