首页> 外文会议>Workshop on Spoken Language Technology >Improving word segmentation for Thai speech translation
【24h】

Improving word segmentation for Thai speech translation

机译:改进泰式语音翻译的词分割

获取原文

摘要

A vocabulary list and language model are primary components in a speech translation system. Generating both from plain text is a straightforward task for English. However, it is quite challenging for Chinese, Japanese, or Thai which provide no word segmentation, i.e. the text has no word boundary delimiter. For Thai word segmentation, Maximal Matching, a lexicon-based approach, is one of the popular methods. Nevertheless this method heavily relies on the coverage of the lexicon. When text contains an unknown word, this method usually produces a wrong boundary. When extracting words from this segmented text, some words will not be retrieved because of wrong segmentation. In this paper, we propose statistical techniques to tackle this problem. Based on different word segmentation methods we develop various speech translation systems and show that the proposed method can significantly improve the translation accuracy by about 6.42% BLEU points compared to the baseline system.
机译:词汇列表和语言模型是语音翻译系统中的主要组件。从纯文本生成两者都是英语的简单任务。但是,对于汉语,日语或泰国提供没有单词分割,即文本没有字边界分隔符,这是非常具有挑战性的。对于泰语分割,最大匹配,基于词汇的方法是流行的方法之一。然而,这种方法严重依赖于词典的覆盖范围。当文本包含一个未知的单词时,此方法通常会产生错误的边界。当从该分段文本中提取单词时,由于错误的分割,不会检索某些单词。在本文中,我们提出了解决这个问题的统计技巧。基于不同的单词分割方法,我们开发各种语音翻译系统,并表明该方法可以显着提高与基线系统相比大约6.42%的BLEU积分的转换精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号