首页> 中文期刊>中国海洋大学学报(自然科学版) >基于隐含狄利克雷分布的多语种文本的自动检测研究

基于隐含狄利克雷分布的多语种文本的自动检测研究

     

摘要

The paper proposed an unsupervised multilingual identification method based on Latent Dirichlet Allocation to deal with the automatic detection of multilingual text.From the perspective of speech recognition,it reforms the LDA for language identification,using n-grams as the features.Different from the usual method of selection of topic number according to the perplexity,the paper introduces a new method based on minimum description length (MDL for short),adopting the Collapsed Gibbs Sampling as the learning method to construct the unsupervised language identification based on the LDA model.The paper takes the mitlm toolkit to generate N-gram counting files and establishes the character level's language model in multilingual identification.Then the paper uses three other language identification systems for comparison with our LDA model.The experiment chooses nine euro languages form the ECI/MCI benchmark to do the identification experiment,at the same time the paper makes a detailed analyze on the trail results,realizing a good accuracy and recall result without any annotation.%本文提出无监督的基于隐含狄利克雷分布(LDA)的潜在语义模型来处理多语种混合文本的语种鉴别问题.区别于一般的依据困惑度对模型进行筛选的方法,本文介绍一种基于最小描述长度(MDL)的新方法,用collapsed Gibbs Sampling(CGS)学习算法来训练得到相应的LDA模型.本文采用mithm工具包生成Ngram计数文件并构建了用于多语种识别的字符级语言模型.之后本文使用了3种不同的语种鉴别系统与LDA模型做对比实验.实验选取ECI/MCI标准数据库中9种欧洲系语言进行鉴别实验,在没有任何标注的情况下,实现了较好的准确率和召回率结果.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号