首页> 美国卫生研究院文献>other >A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation
【2h】

A Systematic Comparison of Data Selection Criteria for SMT Domain Adaptation

机译:SMT域适配的数据选择标准的系统比较

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Data selection has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt statistical machine translation (SMT) systems to in-domain data. This paper performs an in-depth analysis of three different sentence selection techniques. The first one is cosine tf-idf, which comes from the realm of information retrieval (IR). The second is perplexity-based approach, which can be found in the field of language modeling. These two data selection techniques applied to SMT have been already presented in the literature. However, edit distance for this task is proposed in this paper for the first time. After investigating the individual model, a combination of all three techniques is proposed at both corpus level and model level. Comparative experiments are conducted on Hong Kong law Chinese-English corpus and the results indicate the following: (i) the constraint degree of similarity measuring is not monotonically related to domain-specific translation quality; (ii) the individual selection models fail to perform effectively and robustly; but (iii) bilingual resources and combination methods are helpful to balance out-of-vocabulary (OOV) and irrelevant data; (iv) finally, our method achieves the goal to consistently boost the overall translation performance that can ensure optimal quality of a real-life SMT system.
机译:通过从大型通用域语料库中提取句子以使统计机器翻译(SMT)系统适应域内数据,数据选择已显示出有效使用训练数据的显着改进。本文对三种不同的句子选择技术进行了深入的分析。第一个是余弦tf-idf,它来自信息检索(IR)领域。第二种是基于困惑的方法,可以在语言建模领域中找到。文献中已经介绍了应用于SMT的这两种数据选择技术。但是,本文首次提出了该任务的编辑距离。在研究了个体模型之后,在语料库级别和模型级别都提出了这三种技术的组合。对香港法律的汉英语料库进行了比较实验,结果表明:(i)相似性度量的约束度与特定领域的翻译质量不是单调相关的; (ii)个别选择模型无法有效,稳健地执行;但是(iii)双语资源和组合方法有助于平衡词汇量(OOV)和无关数据; (iv)最后,我们的方法达到了始终如一地提高整体翻译性能的目标,从而可以确保现实生活中SMT系统的最佳质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号