首页> 外文期刊>Advances in Electrical and Electronic Engineering >Categorization of Unorganized Text Corpora for better Domain-Specific Language Modeling
【24h】

Categorization of Unorganized Text Corpora for better Domain-Specific Language Modeling

机译:分类非组织文本语料库,以实现更好的领域特定语言建模

获取原文
           

摘要

This paper describes the process of categorization of unorganized text data gathered from the Internet to the in-domain and out-of-domain data for better domain-specific language modeling and speech recognition. An algorithm for text categorization and topic detection based on the most frequent key phrases is presented. In this scheme, each document entered into the process of text categorization is represented by a vector space model with term weighting based on computing the term frequency and inverse document frequency. Text documents are then classified to the in-domain and out-of-domain data automatically with predefined threshold using one of the selected distance/similarity measures comparing to the list of key phrases. The experimental results of the language modeling and adaptation to the judicial domain show significant improvement in the model perplexity about 19 % and decreasing of the word error rate of the Slovak transcription and dictation system about 5,54 %, relatively.
机译:本文介绍了从Internet收集的非组织文本数据到域内和域外数据的分类过程,以实现更好的特定于域的语言建模和语音识别。提出了一种基于最频繁关键词的文本分类与主题检测算法。在这种方案中,进入文本分类过程的每个文档都由一个矢量空间模型表示,该矢量空间模型基于计算术语频率和反向文档频率对术语进行加权。然后,使用选定的距离/相似性度量之一与关键字列表相比,使用预定义的阈值自动将文本文档分类为域内和域外数据。语言建模和适应司法领域的实验结果表明,模型的困惑度显着提高了约19%,斯洛伐克语转录和听写系统的单词错误率相对降低了约5.54%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号