首页> 外文会议>First conference on machine translation >English-French Document Alignment Based on Keywords and Statistical Translation
【24h】

English-French Document Alignment Based on Keywords and Statistical Translation

机译:基于关键词和统计翻译的英法文档对齐

获取原文
获取原文并翻译 | 示例

摘要

In this paper we present our approach to the Bilingual Document Alignment Task (WMT16), where the main goal was to reach the best recall on extracting aligned pages within the provided data.Our approach consists of tree main parts: data preprocessing, keyword extraction and text pairs scoring based on keyword matching.For text preprocessing we use the Tree-Tagger pipeline that contains the Unitok tool (Michelfeit et al., 2014) for tokeniza-tion and the TreeTagger morphological analyzer (Schmid, 1994).After keywords extraction from the texts according TF-IDF scoring our system searches for comparable English-French pairs. Using a statistical dictionary created from a large English-French parallel corpus, the system is able to find comaparable documents.At the end this procedure is combined with the baseline algorithm and best one-to-one pairing is selected. The result reaches 91.6% recall on provided training data.After a deep error analysis (see section 5) the recall reached 97.4%.
机译:在本文中,我们介绍了双语文档对齐任务(WMT16)的方法,其主要目标是在提取所提供数据中的对齐页面方面达到最佳召回率。我们的方法包括树的主要部分:数据预处理,关键字提取和基于关键字匹配的文本对评分。对于文本预处理,我们使用Tree-Tagger管道,该管道包含用于标记化的Unitok工具(Michelfeit等人,2014)和TreeTagger形态分析器(Schmid,1994)。根据TF-IDF评分的文字,我们的系统会搜索可比较的英法对。该系统使用由大型英语-法语平行语料库创建的统计字典,可以找到可比的文档。最后,此过程与基线算法结合,并选择了最佳的一对一配对。根据提供的训练数据,召回率达到91.6%。经过深入的错误分析(请参阅第5节),召回率达到97.4%。

著录项

  • 来源
  • 会议地点 Berlin(DE)
  • 作者单位

    Lexical Computing CZ s.r.o., Centre of Natural Language Processing, Faculty of Informatics, Masaryk University, Botanicka 68a 602 00 Brno;

    Lexical Computing CZ s.r.o., Centre of Natural Language Processing, Faculty of Informatics, Masaryk University, Botanicka 68a 602 00 Brno;

    Lexical Computing CZ s.r.o., Centre of Natural Language Processing, Faculty of Informatics, Masaryk University, Botanicka 68a 602 00 Brno;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号