Improving n-gram probability estimates by compound-head clustering

机译：通过复合头聚类改进n-gram概率估计

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Compounding is one of the most productive word formation processes in many languages and is therefore a main source of data sparsity in language modeling. Many solutions have been suggested to model compound words, most of which break the compound into its constituents and train a new model with them. In earlier work, we argued that this approach is suboptimal and we presented a novel technique that clusters new, domain-specific compound words together with their semantic heads. The clusters were then used to build a class-based n-gram model that enabled a reliable estimation of n-gram probabilities, without the need for additional training data. In this paper, we investigate how this “semantic head mapping” can best be made an integral part of the language modeling strategy and find that, with some adaptations, our technique is capable of producing more accurate compound probability estimates than a baseline word-based n-gram language model, which lead to a significant word error rate reduction for Dutch read speech.

机译：复合是许多语言中最高效的词形成过程之一，因此是语言建模中数据稀疏的主要来源。已经提出了许多解决方案来对复合词建模，其中大多数将复合词分解成其组成部分并使用它们训练新的模型。在较早的工作中，我们认为这种方法不是最优的，并且我们提出了一种新颖的技术，该技术将新的，特定于领域的复合词及其语义头聚集在一起。然后，将这些聚类用于构建基于类的n-gram模型，该模型可以对n-gram概率进行可靠的估计，而无需其他训练数据。在本文中，我们研究了如何最好地将这种“语义头映射”作为语言建模策略的组成部分，并发现，通过一些调整，我们的技术比基于基线的词能够产生更准确的复合概率估计n语法语言模型，可显着降低荷兰语朗读语音的单词错误率。

著录项

来源
《IEEE International Conference on Acoustics, Speech and Signal Processing》|2015年|5221-5225|共5页
会议地点
作者
Pelemans Joris; Demuynck Kris; Van hamme Hugo; Wambacq Patrick;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
LVCSR; data sparsity; language models; n-grams; word clusters;

机译：LVCSR;数据稀疏性;语言模型; n-grams;词簇;

相似文献

外文文献
中文文献
专利

1. Combining link-tracing sampling and cluster sampling to estimate the size of a hidden population in presence of heterogeneous link-probabilities [J] . Felix-Medina Martin H., Monjardin Pedro E., Aceves-Castro Aida N. Survey methodology . 2015,第2期

机译：结合链接跟踪采样和聚类采样以估计存在异构链接概率的隐藏种群的大小
2. Estimating Clustered Population Size Using Two Stage Sampling when Capture Probabilities Vary among Individuals [J] . Qian He, Naima Shifa International Journal of Statistics and Applications . 2013,第3期

机译：当捕获概率在个体之间变化时，使用两阶段抽样来估计聚集人口规模
3. Estimating Default Probabilities of CMBS Loans with Clustering and Heavy Censoring [J] . Yildiray Yildirim The Journal of real estate finance and economics . 2008,第2期

机译：带有聚类和严格审查的CMBS贷款的默认概率估计
4. Improving n-gram probability estimates by compound-head clustering [C] . J. Pelemans, K. Demuynck, H. Van hamme, IEEE International Conference on Acoustics, Speech and Signal Processing . 2015

机译：通过复合头聚类提高N-GRAM概率估计
5. Identifying malware using n-gram clustering metrics. [D] . Dowd, Christopher Ryan. 2014

机译：使用n-gram群集指标识别恶意软件。
6. Probability estimates for the unique childhood leukemia cluster in Fallon Nevada and risks near other U.S. Military aviation facilities. [O] . Craig Steinmaus, Meng Lu, Randall L Todd, 2004

机译：内华达州法伦市独特的儿童白血病集群的概率估计值以及美国其他军事航空设施附近的风险。
7. Improving N-gram probability estimates by compound-head clustering [O] . Pelemans Joris, Demuynck Kris, Van hamme Hugo, 2015

机译：通过复合头聚类改进N元语法估计

Improving n-gram probability estimates by compound-head clustering

摘要

著录项

相似文献

相关主题

期刊订阅