首页> 外文学位 >Domain adaptation of translation models for multilingual applications.
【24h】

Domain adaptation of translation models for multilingual applications.

机译:针对多语言应用的翻译模型的领域适应。

获取原文
获取原文并翻译 | 示例

摘要

The performance of a statistical translation algorithm in the context of multilingual applications such as cross-lingual information retrieval (CLIR) and machine translation (MT) depends on the quality, quantity and proper domain matching of the training data. Traditionally, manual selection and customization of training resources has been the prevailing approach. In addition to being labor-intensive, this approach does not scale to the large quantity of heterogeneous resources that have recently become available, such as parallel text and bilingual thesauri in various domains. More importantly, manual customization does not offer a solution to efficiently and effectively producing tailored translation models for a mixture of heterogeneous target documents in various domains, topics, languages and genres. Translation models trained on a general domain do not work well in technical domains; models trained on written documents are not appropriate for spoken dialogue; models trained on manual transcripts can be sub-optimal for translating noisy transcripts produced by a speech recognizer; finally, models trained on a mixture of topics are not optimal for any of the topic-specific documents.;We seek to address this challenge by automatically adapting translation models (and implicitly parallel training resources) to specific target domains or sub-domains.;The high-level adaptation process involves automatically weighting and combining multiple translation resources, according to several criteria, in order to better match a target corpus or a specific domain sample. The criteria we examine include lexical-level domain match, translation quality estimates, size, and taxonomy representation. An orthogonal dimension in the adaptation process is the granularity level at which these criteria are measured and applied: from the collection level - under the assumption of homogeneous within-collection data - to the document level. The relative contribution of each criterion is subsequently determined by a model that can range from uniform weighting to a global non-linear optimization model trained on application specific evaluation data.;In this thesis, we examine how such adaptation applies to two important multilingual applications: cross-lingual information retrieval and machine translation. In CLIR, we adapt translation models for domain-specific query translation; in MT, we adapt translation models to heterogeneous target corpora and compare them with previously studied target language model adaptation. We use our adaptation algorithms to enhance state-of-the-art systems, seeking to improve performance under different testing conditions and to reduce the demand for large amounts of domain specific parallel data. We also address the challenge of combining multiple criteria to rank parallel sentence candidates. We investigate Continuous Reactive Tabu Search (CRTS) [2], a global optimization method, as well as Reactive Affine Shaker (RASH) [6], an efficient algorithm which continuously adjusts its search area in order to identify a local minimum.;Our experiments in CLIR and statistical MT indicate that selecting training data based on the above-mentioned approaches allows a significant reduction in training data while preserving about 90% of the performance. This result significantly surpasses the random selection approach, and it holds for both CLIR and MT. As expected, the difference increases as the subdomain becomes more specific. Our optimized criteria weights considerably outperform the uniform distribution baseline, as well as lexical similarity adaptation.
机译:统计翻译算法在多语言应用程序(如跨语言信息检索(CLIR)和机器翻译(MT))中的性能取决于训练数据的质量,数量和适当的域匹配。传统上,手动选择和定制培训资源一直是主流方法。除了劳动强度大之外,这种方法还不能扩展到最近可用的大量异构资源,例如各个领域的平行文本和双语叙词表。更重要的是,手动定制无法提供有效解决方案,可以有效地为各种领域,主题,语言和体裁的异构目标文档混合生成量身定制的翻译模型。在一般领域中训练的翻译模型在技术领域中效果不佳;经过书面文件训练的模型不适合口头对话;在手动转录本上训练的模型对于翻译语音识别器产生的嘈杂转录本可能不是最佳的;最后,对于任何主题特定的文档,针对主题混合训练的模型都不是最佳选择。我们试图通过自动将翻译模型(和隐式并行训练资源)调整为特定目标领域或子领域来应对这一挑战。高级适应过程包括根据几个标准自动加权和组合多个翻译资源,以便更好地匹配目标语料库或特定领域样本。我们检查的标准包括词汇级别的域匹配,翻译质量估计,大小和分类法表示。适应过程中的正交维度是衡量和应用这些标准的粒度级别:从收集级别(假设收集的内部数据均一)到文档级别。每个标准的相对贡献随后由一个模型确定,该模型的范围可以从统一加权到在特定于应用程序的评估数据上训练的全局非线性优化模型。在本论文中,我们研究了这种适应性如何应用于两个重要的多语言应用程序:跨语言信息检索和机器翻译。在CLIR中,我们将翻译模型用于特定领域的查询翻译;在MT中,我们将翻译模型改编为异构目标语料库,并将其与先前研究的目标语言模型改编进行比较。我们使用自适应算法来增强最先进的系统,以寻求在不同测试条件下提高性能,并减少对大量特定领域并行数据的需求。我们还解决了组合多个标准来对平行句子候选者进行排名的挑战。我们研究了全局优化方法连续反应禁忌搜索(CRTS)[2],以及有效仿射摇床(RASH)[6],该算法不断调整其搜索范围以识别局部最小值。 CLIR和统计MT中的实验表明,基于上述方法选择训练数据可显着减少训练数据,同时保留约90%的性能。该结果大大超过了随机选择方法,对于CLIR和MT均适用。正如预期的那样,差异随着子域变得更加具体而增加。我们优化的标准权重大大优于统一分布基准以及词汇相似度适应。

著录项

  • 作者

    Rogati, Monica.;

  • 作者单位

    Carnegie Mellon University.;

  • 授予单位 Carnegie Mellon University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2009
  • 页码 117 p.
  • 总页数 117
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号