Robust Language Modeling for a Small Corpus of Target Tasks Using Class-Combined Word Statistics and Selective Use of a General Corpus

Yosuke Wada; Norihiko Kobayashi; Tetsunori Kobayashi

首页> 外文期刊>Systems and Computers in Japan >Robust Language Modeling for a Small Corpus of Target Tasks Using Class-Combined Word Statistics and Selective Use of a General Corpus

【24h】

Robust Language Modeling for a Small Corpus of Target Tasks Using Class-Combined Word Statistics and Selective Use of a General Corpus

机译：使用类组合词统计和通用语料库的选择性使用，对目标任务的小型语料库进行稳健的语言建模

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In order to improve the accuracy of language models in speech recognition tasks for which collecting a large text corpus for language model training is difficult, we propose a class-combined bigram and selective use of general text. In the class-combined bigram, the word bigram and the class bigram are combined using weights that are expressed as the functions of the preceding word frequency and the succeeding word-type count. An experiment has shown that the accuracy of the proposed class-combined bigram is equivalent to that of the word bigram trained with a text corpus that is approximately three times larger. In the selective use of general text, the language model was corrected by automatically selecting sentences that were expected to produce better accuracy from a large volume of text collected without specifying the task and by adding these sentences to a small corpus of target tasks. An experiment has shown that the recognition error rate was reduced by up to 12% compared to a case in which text was not selected. Lastly, when we created a model that uses both the class-combined bigram and text addition, further improvements were obtained, resulting in improvements of approximately 34% in adjusted perplexity and approximately 31% in the recognition error rate compared to the word bigram created from the target task text only.

机译：为了提高语音识别任务中语言模型的准确性，这些任务很难收集大型文本语料库来进行语言模型训练，我们提出了一种结合了类的双字母组和选择性使用通用文本的建议。在类组合的二元组中，单词bigram和类二元组使用权重进行组合，这些权重表示为前一个单词频率和后继单词类型计数的函数。实验表明，所提出的类组合双语法例的准确性与使用文本语料库训练的双语法例词的准确性大约相等，后者大约大三倍。在选择性使用普通文本时，通过从不指定任务的情况下自动从大量文本中选择预期会产生更高准确性的句子，并将这些句子添加到目标任务的小型语料库中，从而纠正了语言模型。实验表明，与未选择文本的情况相比，识别错误率降低了多达12％。最后，当我们创建一个同时使用类组合双字词和文本加法的模型时，获得了进一步的改进，与从中创建的双字词相比，调整后的困惑度提高了约34％，识别错误率提高了约31％。仅目标任务文本。

著录项

来源
《Systems and Computers in Japan》 |2003年第12期|共11页
作者
Yosuke Wada; Norihiko Kobayashi; Tetsunori Kobayashi;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
Large-vocabulary continuous speech recognition; Language model; Class N-gram; Task adaptation;

机译：大词汇连续语音识别;语言模型;N-gram类;任务自适应;

相似文献

外文文献
中文文献
专利

1. Robust Language Modeling for a Small Corpus of Target Tasks Using Class-Combined Word Statistics and Selective Use of a General Corpus [J] . Yosuke Wada, Norihiko Kobayashi, Tetsunori Kobayashi Systems and Computers in Japan . 2003,第12期

机译：使用类组合词统计和通用语料库的选择性使用，对目标任务的小型语料库进行稳健的语言建模
2. Construction of a generic stopwords list for Hindi language without corpus statistics [J] . Sifatullah Siddiqi, Aditi Sharan International Journal of Advanced Computer Research . 2018,第34期

机译：构建没有语料统计的印地语通用停用词列表
3. Building Statistical Language Models for Persian Continuous Speech Recognition Systems Using the Peykare Corpus [J] . Mohammad Bahrani, Hossein Sameti International journal of computer processing of languages . 2011,第1期

机译：使用Peykare语料库为波斯语连续语音识别系统建立统计语言模型
4. Word statistics of Turkish language on a large scale text corpus - TurCo [C] . Dalkilic G., Cebi Y. Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on . 2004

机译：大规模文本语料库上土耳其语的单词统计-TurCo
5. A machine-aided approach to intelligent index generation: Using natural language processing and latent semantic analysis to determine the contexts and relationships among words in a corpus. [D] . Lukon, Shelly Candita. 2006

机译：一种机器辅助的智能索引生成方法：使用自然语言处理和潜在语义分析来确定语料库中单词之间的上下文和关系。
6. Measuring open-set word recognition in school-aged children: Corpus of monosyllabic target words and speech maskers [O] . Angela Yarnell Bonino, Ashley R. Malley -1

机译：测量学龄儿童的开放式单词识别：单音节目标单词和语音掩盖语的语料库
7. Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling [O] . 2017

机译：基于单词和音素的波兰语语言建模的正字和音位语料库的统计分析

Robust Language Modeling for a Small Corpus of Target Tasks Using Class-Combined Word Statistics and Selective Use of a General Corpus

摘要

著录项

相似文献

相关主题

期刊订阅