首页> 外文OA文献 >Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
【2h】

Automatic parallel corpora and bilingual terminology extraction from parallel WebSites

机译:从并行网站自动提取并行语料库和双语术语

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

In our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, but we miss corpora for other different areas. Currently there is a huge investment on using the Web as a corpus. This article uncovers GWB, a tool that aims automatic construction of parallel corpora from the web. We defend that it is possible to build high quality terminological corpora in an automatic fashion, just by specifying a sensible Internet domain and using an appropriate set of seed keywords. GWB is a web-spider that works in conjunction with a set of other Open-Source tools, defining a pipeline that includes the documents retrieval from the web, alignment at sentence level and its quality analysis, bilingual dictionaries and terminology extraction and construction of off-line dictionaries.
机译:在当今时代,并行语料库的概念,重要性和意义是如此之大,无需特别介绍。不幸的是,公共可用并行语料库的范围受到一定限制。关于政治或立法,医学和其他特定领域的语料库很多,但我们错过了其他不同领域的语料库。当前,在使用Web作为语料库方面有大量投资。本文介绍了GWB,该工具旨在从网络自动构建并行语料库。我们辩称,仅通过指定一个明智的Internet域并使用一组适当的种子关键字,就可以以自动方式构建高质量的术语库。 GWB是一种网络蜘蛛程序,可与一组其他开放源代码工具一起使用,定义了一个管道,该管道包括从网络中检索文档,在句子级别进行对齐及其质量分析,双语词典以及术语提取和构建行字典。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号