【24h】

Building Parallel Corpora from Movies

机译:从电影中建立平行的Corpora

获取原文

摘要

This paper proposes to use DTW to construct parallel corpora from difficult data. Parallel corpora are considered as raw material for machine translation (MT), frequently, MT systems use European or Canadian parliament corpora. In order to achieve a realistic machine translation system, we decided to use movie subtitles. These data could be considered difficult because they contain unfamiliar expressions, abbreviations, hesitations, words which do not exist in classical dictionaries (as vulgar words), etc. The obtained parallel corpora can constitute a rich resource to train decoding spontaneous speech translation system. From 40 movies, we align 43013 English subtitles with 42306 French subtitles. This leads to 37625 aligned pairs with a precision of 92.3%.
机译:本文建议使用DTW从困难数据构建平行语料库。平行的Corpora被认为是机器翻译(MT)的原料,经常,MT系统使用欧洲或加拿大议会基层。为了实现现实的机器翻译系统,我们决定使用电影字幕。这些数据可能被认为是困难的,因为它们包含不熟悉的表达式,缩写,犹豫,在经典词典(作为粗俗单词)等中不存在的单词。所获得的并行技术可以构成培训解码自发语音翻译系统的丰富资源。从40部电影中,我们用42306个法国字幕对齐43013个英语字幕。这导致37625对齐对,精度为92.3%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号