首页> 外文会议>IEEE International Conference on Acoustics, Speech and Signal Processing >Improving n-gram probability estimates by compound-head clustering
【24h】

Improving n-gram probability estimates by compound-head clustering

机译:通过复合头聚类改进n-gram概率估计

获取原文

摘要

Compounding is one of the most productive word formation processes in many languages and is therefore a main source of data sparsity in language modeling. Many solutions have been suggested to model compound words, most of which break the compound into its constituents and train a new model with them. In earlier work, we argued that this approach is suboptimal and we presented a novel technique that clusters new, domain-specific compound words together with their semantic heads. The clusters were then used to build a class-based n-gram model that enabled a reliable estimation of n-gram probabilities, without the need for additional training data. In this paper, we investigate how this “semantic head mapping” can best be made an integral part of the language modeling strategy and find that, with some adaptations, our technique is capable of producing more accurate compound probability estimates than a baseline word-based n-gram language model, which lead to a significant word error rate reduction for Dutch read speech.
机译:复合是许多语言中最高效的词形成过程之一,因此是语言建模中数据稀疏的主要来源。已经提出了许多解决方案来对复合词建模,其中大多数将复合词分解成其组成部分并使用它们训练新的模型。在较早的工作中,我们认为这种方法不是最优的,并且我们提出了一种新颖的技术,该技术将新的,特定于领域的复合词及其语义头聚集在一起。然后,将这些聚类用于构建基于类的n-gram模型,该模型可以对n-gram概率进行可靠的估计,而无需其他训练数据。在本文中,我们研究了如何最好地将这种“语义头映射”作为语言建模策略的组成部分,并发现,通过一些调整,我们的技术比基于基线的词能够产生更准确的复合概率估计n语法语言模型,可显着降低荷兰语朗读语音的单词错误率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号