首页> 外文会议>9th International conference on language resources and evaluation >caWaC - A web corpus of Catalan and its application to language modeling and machine translation
【24h】

caWaC - A web corpus of Catalan and its application to language modeling and machine translation

机译:caWaC-加泰罗尼亚语的网络语料库及其在语言建模和机器翻译中的应用

获取原文

摘要

in this paper we present the construction process of a web corpus of Catalan built from the content of the cat top-level domain, For collecting and processing data we use the Brno pipeline with the spiderling crawler and its accompanying tools. To the best of our knowledge the corpus represents the largest existing corpus of Catalan containing 687 million words, which is a significant increase given thai until now the biggest corpus of Catalan. CuCWeb, counts 166 million words. We evaluate the resulting resource on the tasks of language modeling and statistical machine translation (SMT) by calculating 1.M perplexity and incorporating the I.M in the SMT pipeline- We compare language models trained on different subsets of the resource with those trained on the Catalan Wikipedia and the target side of the parallel data used to train the SMT system.
机译:在本文中,我们介绍了从cat顶级域的内容构建的加泰罗尼亚语Web语料库的构建过程。为了收集和处理数据,我们将Brno管道与蜘蛛爬虫及其随附工具一起使用。据我们所知,该语料库是加泰罗尼亚语中现有的最大语料库,包含6.87亿个单词,这是迄今为止泰国语最大的语料库,这是一个显着的增长。 CuCWeb,统计1.66亿个单词。我们通过计算1.M困惑并将IM整合到SMT管道中来评估语言建模和统计机器翻译(SMT)任务上的最终资源-我们将在资源的不同子集上训练的语言模型与在加泰罗尼亚语上训练的语言模型进行比较维基百科和目标端的并行数据用于训练SMT系统。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号