首页> 外文会议>International Conference on Speech Database and Assessments >Construction of Chinese conversational corpora for spontaneous speech recognition and comparative study on the trilingual parallel corpora
【24h】

Construction of Chinese conversational corpora for spontaneous speech recognition and comparative study on the trilingual parallel corpora

机译:中国谈话基层建设自发性语音识别与三语平行语料库比较研究

获取原文

摘要

In this paper, we describe the development of Chinese conversational segmented and POS-tagged corpora currently used in the NICT/ATR speech-to-speech translation system. Over 500K manually checked utterances provide 3.5M words of Chinese corpora. As far as we know, they are the largest conversational textual corpora; in the domain of travel. A set of three parallel corpora is obtained with the corresponding pairs of Japanese and English words from which the Chinese words are translated. Based on these parallel corpora, we make an investigation on the statistics of each language, performances of language model and speech recognition, and find the differences among these languages. The problems and their solutions to the present Chinese corpora are also analyzed and discussed.
机译:在本文中,我们描述了目前用于NICT / ATR语音翻译系统中使用的中国会话分段和POS标记的Corpora的发展。手动检查的话语超过500k,提供3.5亿字的中国语料。据我们所知,他们是最大的会话文本语料库;在旅行领域。使用中文单词的相应对日语和英语单词进行了一组三个平行的语料库。基于这些平行的语料库,我们对每种语言,语言模型和语音识别表演的统计数据进行调查,并找到这些语言之间的差异。还分析并讨论了目前中国语料库的问题及其解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号