【24h】

BSTC: A Large-Scale Chinese-English Speech Translation Dataset

机译:BSTC:一个大型汉英语音翻译数据集

获取原文

摘要

This paper presents BSTC (Baidu Speech Translation Corpus), a large-scale Chinese-English speech translation dataset. This dataset is constructed based on a collection of licensed videos of talks or lectures, including about 68 hours of Mandarin data, their manual transcripts and translations into English, as well as automated transcripts by an automatic speech recognition (ASR) model. We have further asked three experienced interpreters to simultaneously interpret the testing talks in a mock conference setting. This corpus is expected to promote the research of automatic simultaneous translation as well as the development of practical systems. We have organized simultaneous translation tasks and used this corpus to evaluate automatic simultaneous translation systems.
机译:本文提出了BSTC(百度语音翻译语料库),这是一个大型汉英语音翻译数据集。 此数据集根据会谈或讲座的许可视频集合构建,包括大约68小时的普通话数据,他们的手动成绩单和翻译成英文,以及通过自动语音识别(ASR)模型的自动转录物。 我们进一步提出了三个经验丰富的口译员同时解释模拟会议设置中的测试会谈。 该语料库预计将促进自动同步翻译的研究以及实用系统的发展。 我们组织了同步翻译任务,并使用该语料库来评估自动同步翻译系统。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号