首页> 外文学位 >Exploiting comparable corpora.
【24h】

Exploiting comparable corpora.

机译:利用可比语料库。

获取原文
获取原文并翻译 | 示例

摘要

One of the major bottlenecks in the development of Statistical Machine Translation systems for most language pairs is the lack of bilingual parallel training data. Currently available parallel corpora span relatively few language pairs and very few domains; building new ones of sufficiently large size and high quality is time-consuming and expensive.; In this thesis, I propose methods that enable automatic creation of parallel corpora by exploiting a rich, diverse, and readily available resource: comparable corpora. Comparable corpora are bilingual texts that, while not parallel in the strict sense, are somewhat related and convey overlapping information. Such texts exist in large quantities on the Web; a good example are the multilingual news feeds produced by news agencies such as Agence France Presse, CNN, and BBC.; I present novel methods for extracting parallel data of good quality from such comparable collections. I show how to detect parallelism at various granularity levels, and thus find parallel documents (if there are any in the collection), parallel sentences, and parallel sub-sentential fragments. In order to demonstrate the validity of this approach, I use my method to extract data from large-scale comparable corpora for various language pairs, and show that the extracted data helps improve the end-to-end performance of a state-of-the art machine translation system.
机译:大多数语言对的统计机器翻译系统开发中的主要瓶颈之一是缺少双语并行训练数据。当前可用的并行语料库跨越相对较少的语言对和非常少的域;制造足够大尺寸和高质量的新产品既费时又昂贵。在本文中,我提出了通过利用丰富,多样且易于获得的资源(可比语料库)来自动创建并行语料库的方法。可比语料库是双语文本,尽管严格意义上不是平行的,但在某种程度上是相关的,并传达了重叠的信息。这样的文本在网络上大量存在。例如,法新社,CNN和BBC等新闻机构制作的多语言新闻提要就是一个很好的例子。我提出了从此类可比数据集中提取高质量并行数据的新颖方法。我展示了如何在各种粒度级别上检测并行度,从而找到并行文档(如果集合中有并行文档),并行句子和并行子句片段。为了证明这种方法的有效性,我使用我的方法从各种语言对的大规模可比语料库中提取数据,并表明所提取的数据有助于改善当前状态的端到端性能。艺术机器翻译系统。

著录项

  • 作者

    Munteanu, Dragos Stefan.;

  • 作者单位

    University of Southern California.;

  • 授予单位 University of Southern California.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 128 p.
  • 总页数 128
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号