Categorization of Unorganized Text Corpora for better Domain-Specific Language Modeling

首页> 外文期刊>Advances in Electrical and Electronic Engineering >Categorization of Unorganized Text Corpora for better Domain-Specific Language Modeling

【24h】

Categorization of Unorganized Text Corpora for better Domain-Specific Language Modeling

机译：分类非组织文本语料库，以实现更好的领域特定语言建模

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper describes the process of categorization of unorganized text data gathered from the Internet to the in-domain and out-of-domain data for better domain-specific language modeling and speech recognition. An algorithm for text categorization and topic detection based on the most frequent key phrases is presented. In this scheme, each document entered into the process of text categorization is represented by a vector space model with term weighting based on computing the term frequency and inverse document frequency. Text documents are then classified to the in-domain and out-of-domain data automatically with predefined threshold using one of the selected distance/similarity measures comparing to the list of key phrases. The experimental results of the language modeling and adaptation to the judicial domain show significant improvement in the model perplexity about 19 % and decreasing of the word error rate of the Slovak transcription and dictation system about 5,54 %, relatively.

机译：本文介绍了从Internet收集的非组织文本数据到域内和域外数据的分类过程，以实现更好的特定于域的语言建模和语音识别。提出了一种基于最频繁关键词的文本分类与主题检测算法。在这种方案中，进入文本分类过程的每个文档都由一个矢量空间模型表示，该矢量空间模型基于计算术语频率和反向文档频率对术语进行加权。然后，使用选定的距离/相似性度量之一与关键字列表相比，使用预定义的阈值自动将文本文档分类为域内和域外数据。语言建模和适应司法领域的实验结果表明，模型的困惑度显着提高了约19％，斯洛伐克语转录和听写系统的单词错误率相对降低了约5.54％。

著录项

来源
《Advances in Electrical and Electronic Engineering》 |2013年第5期|共6页
作者

展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类工业技术;
关键词

相似文献

外文文献
中文文献
专利

1. Toward live domain-specific languages: From text differencing to adapting models at run time [J] . van Rozen Riemer, van der Storm Tijs Software and systems modeling . 2019,第1期

机译：面向实时领域特定语言：从文本差异到运行时的适应模型
2. Classification of heterogeneous text data for robust domain-specific language modeling [J] . Ján Sta?, Jozef Juhár, Daniel Hládek EURASIP journal on audio, speech, and music processing . 2014,第1期

机译：异类文本数据的分类，以实现强大的领域特定语言建模
3. Overcoming Language Barriers: Assessing the Potential of Machine Translation and Topic Modeling for the Comparative Analysis of Multilingual Text Corpora [J] . Reber Ueli Communication Methods and Measures . 2019,第2期

机译：克服语言障碍：评估机器翻译和主题建模的潜力，以了解多语言文本语料库的比较分析
4. Cross language Text Categorization by acquiringMultilingual Domain Models from Comparable Corpora [C] . Alfio Gliozzo, Carlo Strapparava 43rd Annual Meeting of the Association for Computational Linguistics: Proceeding of the Conference . 2005

机译：通过从可比语料库中获取多语言域模型来进行跨语言文本分类
5. Fast and Effective Approximations for Summarization and Categorization of Very Large Text Corpora. [D] . Godbehere, Andrew B. 2015

机译：快速有效的近似值，用于超大文本语料库的归纳和分类。
6. Empirical automated vocabulary discovery using large text corpora and advanced natural language processing tools. [O] . W. R. Hersh, E. H. Campbell, D. A. Evans, 1996

机译：使用大型文本语料库和先进的自然语言处理工具进行经验性的自动词汇发现。
7. Categorization of unorganized text corpora for better domain-specific language modeling [O] . Staš Ján, Zlacký Daniel, Hládek Daniel, 2013

机译：分类无组织的文本语料库，以实现更好的领域特定语言建模

Categorization of Unorganized Text Corpora for better Domain-Specific Language Modeling

摘要

著录项

相似文献

相关主题

期刊订阅