...
首页> 外文期刊>Language Resources and Evaluation >Crawl and crowd to bring machine translation to under-resourced languages
【24h】

Crawl and crowd to bring machine translation to under-resourced languages

机译:爬行和拥挤,将机器翻译带到资源匮乏的语言

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

We present a widely applicable methodology to bring machine translation (MT) to under-resourced languages in a cost-effective and rapid manner. Our proposal relies on web crawling to automatically acquire parallel data to train statistical MT systems if any such data can be found for the language pair and domain of interest. If that is not the case, we resort to (1) crowdsourcing to translate small amounts of text (hundreds of sentences), which are then used to tune statistical MT models, and (2) web crawling of vast amounts of monolingual data (millions of sentences), which are then used to build language models for MT. We apply these to two respective use-cases for Croatian, an under-resourced language that has gained relevance since it recently attained official status in the European Union. The first use-case regards tourism, given the importance of this sector to Croatia's economy, while the second has to do with tweets, due to the growing importance of social media. For tourism, we crawl parallel data from 20 web domains using two state-of-the-art crawlers and explore how to combine the crawled data with bigger amounts of general-domain data. Our domain-adapted system is evaluated on a set of three additional tourism web domains and it outperforms the baseline in terms of automatic metrics and/or vocabulary coverage. In the social media use-case, we deal with tweets from the 2014 edition of the soccer World Cup. We build domain-adapted systems by (1) translating small amounts of tweets to be used for tuning by means of crowdsourcing and (2) crawling vast amounts of monolingual tweets. These systems outperform the baseline (Microsoft Bing) by 7.94 BLEU points (5.11 TER) for Croatian-to-English and by 2.17 points (1.94 TER) for English-to-Croatian on a test set translated by means of crowdsourcing. A complementary manual analysis sheds further light on these results.
机译:我们提供一种广泛适用的方法,以经济高效,快速的方式将机器翻译(MT)引入资源匮乏的语言。我们的建议依靠网络爬网自动获取并行数据以训练统计MT系统(如果可以找到针对语言对和感兴趣领域的任何此类数据)。如果不是这种情况,我们将采取以下措施:(1)众包翻译少量文本(数百个句子),然后将其用于调整统计MT模型;(2)对大量单语数据(数百万)进行网络抓取句子),然后将其用于构建MT的语言模型。我们将它们分别应用于克罗地亚语的两个用例,这是一种资源匮乏的语言,自从它最近在欧洲联盟中获得正式地位以来,已经引起了人们的关注。考虑到该行业对克罗地亚经济的重要性,第一个用例涉及旅游业,而第二个用例与推文有关,因为社交媒体的重要性日益增加。对于旅游业,我们使用两个最先进的搜寻器从20个网络域中搜寻并行数据,并探索如何将搜寻到的数据与更大量的通用域数据相结合。我们的域名适应系统在一组三个附加旅游网域上进行了评估,在自动指标和/或词汇覆盖率方面均优于基线。在社交媒体用例中,我们处理了2014年世界杯足球赛的推文。我们通过(1)翻译少量的推文(通过众包进行调优)和(2)爬行大量的单语推文,来构建适应域的系统。在通过众包翻译的测试集上,这些系统的基准语言(Microsoft Bing)优于克罗地亚语到英语7.94 BLEU点(5.11 TER),英语到克罗地亚语优于2.17点(1.94 TER)。补充的手动分析进一步阐明了这些结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号