首页> 外文期刊>Current Science: A Fortnightly Journal of Research >Language model adaptation in Tamil language using cross-lingual latent semantic analysis with document aligned corpora
【24h】

Language model adaptation in Tamil language using cross-lingual latent semantic analysis with document aligned corpora

机译:使用跨语言潜在语义分析和文档对齐语料库对泰米尔语语言模型进行适应

获取原文
获取原文并翻译 | 示例
       

摘要

Unlike English, Tamil does not have sufficient volume of text corpus to build a reliable language model. In this work, domain independent language model has been built with 500 Tamil documents. To improve the language model, adaptation with translation lexicons in Tamil generated from English using cross-lingual latent semantic analysis (CLSA) has been employed. Since Tamil is an agglutinative language, usage of surface word forms in CLSA will not yield better translation accuracy. Lexical gap between English and Tamil words has been reduced by the proposed partial morphological analysis in Tamil. This has improved the translation accuracy. Experiments have been conducted with direct and topic-specific model adaptations to improve the domain independent model. Significant improvements have been obtained in terms of perplexity and word error rate.
机译:与英语不同,泰米尔语没有足够的文本语料库来构建可靠的语言模型。在这项工作中,已经使用500个泰米尔语文档构建了领域无关的语言模型。为了改善语言模型,已使用通过跨语言潜在语义分析(CLSA)从英语生成的泰米尔语翻译词典进行改编。由于泰米尔语是一种凝集性语言,因此在CLSA中使用表面单词形式不会产生更好的翻译准确性。通过在泰米尔语中进行部分形态学分析,减少了英语和泰米尔语单词之间的词汇差距。这提高了翻译准确性。已经针对直接和主题​​特定的模型进行了实验,以改善领域独立模型。在困惑度和字错误率方面已经获得了显着的改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号