首页> 外文OA文献 >The role of syntax and semantics in machine translation and quality estimation of machine-translated user-generated content
【2h】

The role of syntax and semantics in machine translation and quality estimation of machine-translated user-generated content

机译:语法和语义在机器翻译和机器翻译的用户生成内容的质量评估中的作用

摘要

The availability of the Internet has led to a steady increase in the volume of online user-generated content, the majority of which is in English. Machine-translating this content to other languages can help disseminate the information contained in it to a broader audience. However, reliably publishing these translations requires a prior estimate of their quality. This thesis is concerned with the statistical machine translation of Symantec's Norton forum content, focusing in particular on its quality estimation (QE) using syntactic and semantic information. We compare the output of phrase-based and syntax-based English-to-French and English-to-German machine translation (MT) systems automatically and manually, and nd that the syntax-based methods do not necessarily handle grammar-related phenomena in translation better than the phrase-based methods. Although these systems generate suciently dierent outputs, the apparent lack of a systematic dierence between these outputs impedes its utilisation in a combination framework. To investigate the role of syntax and semantics in quality estimation of machine translation, we create SymForum, a data set containing French machine translations of English sentences from Norton forum content, their post-edits and their adequacy and uency scores. We use syntax in quality estimation via tree kernels, hand-crafted features and their combination, and nd it useful both alone and in combination with surface-driven features. Our analyses show that neither the accuracy of the syntactic parses used by these systems nor the parsing quality of the MT output aect QE performance. We also nd that adding more structure to French Treebankudparse trees can be useful for syntax-based QE. We use semantic role labelling (SRL) for our semantic-based QE experiments. We experiment with the limited resources that are available for French and nd that a small manually annotated training set is substantially more useful than a much larger articially created set. We use SRL in quality estimation using tree kernels, hand-crafted features and their combination. Additionally, we introduce PAM, a QE metric based on the predicate-argument structure match between source and target. We nd that the SRL quality, especially on the target side, is the major factor negatively aecting the performance of the semantic-based QE. Finally, we annotate English and French Norton forum sentences with their phrase structure syntax using an annotation strategy adapted for user-generated text. We nd that user errors occur in only a small fraction of the data, but their correction does improve parsing performance. These treebanks (Foreebank) prove to be useful as supplementary training data in adapting the parsers to the forum text. The improved parses ultimately increase the performance of the semantic-based QE. However, a reliable semantic-based QE system requires further improvements in the quality of the underlying semantic role labelling.
机译:Internet的可用性已导致在线用户生成的内容的数量稳步增长,其中大多数为英语。将这些内容机器翻译成其他语言可以帮助将其包含的信息传播给更广泛的受众。但是,可靠地发布这些翻译需要对其质量进行事先评估。本文涉及赛门铁克诺顿论坛内容的统计机器翻译,尤其关注使用句法和语义信息的质量评估(QE)。我们将自动和手动比较基于短语和基于语法的英语到法语和英语到德语机器翻译(MT)系统的输出,并且发现基于语法的方法不一定能处理与语法相关的现象。翻译比基于短语的方法更好。尽管这些系统产生了不同的输出,但是这些输出之间明显缺乏系统的差异阻碍了其在组合框架中的利用。为了研究语法和语义在机器翻译质量估计中的作用,我们创建了SymForum,这是一个数据集,其中包含来自Norton论坛内容,其后期编辑以及它们的充分性和友善性分数的法语句子的法国机器翻译。我们在通过树核,手工制作的要素及其组合进行质量评估时使用了语法,并且它在单独使用或与表面驱动的要素结合使用时均很有用。我们的分析表明,这些系统所使用的语法解析的准确性和MT输出的解析质量都不影响QE性能。我们还发现,对法语Treebank udparse树添加更多结构对于基于语法的QE可能很有用。我们在基于语义的QE实验中使用了语义角色标签(SRL)。我们尝试了可用于法语的有限资源,并且发现,一个小的手动注释训练集比一个更大的人工创建训练集有用得多。我们在使用树核,手工制作的特征及其组合的质量评估中使用SRL。此外,我们介绍了PAM,这是一种基于源和目标之间的谓词-参数结构匹配的QE指标。我们发现,SRL质量(尤其是在目标方面)是负面影响基于语义的QE性能的主要因素。最后,我们使用适用于用户生成的文本的注释策略,对英语和法语Norton论坛句子及其短语结构语法进行注释。我们发现,用户错误仅出现在一小部分数据中,但是它们的纠正确实提高了解析性能。这些树库(Foreebank)被证明可以作为使解析器适应论坛文本的补充培训数据。改进的解析最终提高了基于语义的QE的性能。但是,可靠的基于语义的QE系统需要进一步改进基础语义角色标记的质量。

著录项

  • 作者

    Zadeh Kaljahi Rasoul Samad;

  • 作者单位
  • 年度 2015
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号