首页> 外文期刊>EURASIP journal on audio, speech, and music processing >Classification of heterogeneous text data for robust domain-specific language modeling
【24h】

Classification of heterogeneous text data for robust domain-specific language modeling

机译:异类文本数据的分类,以实现强大的领域特定语言建模

获取原文
           

摘要

The robustness of n -gram language models depends on the quality of text data on which they have been trained. The text corpora collected from various resources such as web pages or electronic documents are characterized by many possible topics. In order to build efficient and robust domain-specific language models, it is necessary to separate domain-oriented segments from the large amount of text data, and the remaining out-of-domain data can be used only for updating of existing in-domain n -gram probability estimates. In this paper, we describe the process of classification of heterogeneous text data into two classes, to the in-domain and out-of-domain data, mainly used for language modeling in the task-oriented speech recognition from judicial domain. The proposed algorithm for text classification is based on detection of theme in short text segments based on the most frequent key phrases. In the next step, each text segment is represented in vector space model as a feature vector with term weighting. For classification of these text segments to the in-domain and out-of domain area, document similarity with automatic thresholding are used. The experimental results of modeling the Slovak language and adaptation to the judicial domain show significant improvement in the model perplexity and increasing the performance of the Slovak transcription and dictation system.
机译:n语法语言模型的鲁棒性取决于对其进行训练的文本数据的质量。从各种资源(例如网页或电子文档)收集的文本语料库具有许多可能的主题。为了构建有效且健壮的特定于域的语言模型,有必要从大量文本数据中分离出面向域的段,而其余的域外数据只能用于更新现有域内数据n克概率估计。在本文中,我们描述了将异构文本数据分为域内和域外数据两类的过程,主要用于司法领域的面向任务的语音识别中的语言建模。所提出的文本分类算法是基于对基于最频繁的关键短语的短文本段中主题的检测。在下一步中,每个文本段在向量空间模型中表示为具有词项加权的特征向量。为了将这些文本段分类为域内和域外区域,使用具有自动阈值的文档相似性。对斯洛伐克语言进行建模并适应司法领域的实验结果表明,该模型的困惑得到了显着改善,并提高了斯洛伐克转录和听写系统的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号