首页> 外文会议>Asian Language Processing, 2009. IALP '09 >Semi-supervised Learning of Domain-Specific Language Models from General Domain Data
【24h】

Semi-supervised Learning of Domain-Specific Language Models from General Domain Data

机译:从通用领域数据半监督学习领域特定语言模型

获取原文

摘要

We present a semi-supervised learning method for building domain-specific language models (LM) from general-domain data. This method is aimed to use small amount of domain-specific data as seeds to tap domain-specific resources residing in larger amount of general-domain data with the help of topic modeling technologies. The proposed algorithm first performs topic decomposition (TD) on the combined dataset of domain-specific and general-domain data using probabilistic latent semantic analysis (PLSA). Then it derives domain-specific word n-gram counts with mixture modeling scheme of PLSA. Finally, it uses traditional n-gram modeling approach to construct domain-specific LMs from the domain-specific word n-gram counts. Experimental results show that this approach can outperform both stat-of-the-art methods and the simulated supervised learning method with our data sets. In particular, the semi-supervised learning method can achieve better performance even with very small amount of domain-specific data.
机译:我们提出了一种从通用域数据中构建域特定语言模型(LM)的半监督学习方法。此方法旨在使用少量领域特定数据作为种子,借助主题建模技术来挖掘驻留在大量通用域数据中的领域特定资源。所提出的算法首先使用概率潜在语义分析(PLSA)对特定领域数据和一般领域数据的组合数据集执行主题分解(TD)。然后利用PLSA的混合建模方案得出特定领域的词n-gram计数。最后,它使用传统的n-gram建模方法从特定于域的单词n-gram计数构造特定于域的LM。实验结果表明,采用我们的数据集,该方法可以胜过最新方法和模拟监督学习方法。特别是,即使使用非常少量的特定领域数据,半监督学习方法也可以实现更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号