【24h】

brWaC: A WaCky Corpus for Brazilian Portuguese

机译:brWaC:巴西葡萄牙语的WaCky语料库

获取原文

摘要

Initiatives for constructing very large corpora have increased in recent years, especially using the Web as corpus since large corpora are crucial for many Natural Language Processing tasks. The WaCky (Web-As-Corpus Kool Yinitiative) methodology has been used to build very large corpora (over a billion words each) for languages like English, Italian and German among others. In this paper we present the ongoing work on building brWaC, a massive Brazilian Portuguese corpus crawled from .br domains. At the moment, the crawling process and the PoS tagging are finished, resulting in a tokenized and lemmatized corpus of 3 billion words. Next step is parsing the whole corpus.
机译:近年来,构建大型语料库的计划有所增加,特别是使用Web作为语料库,因为大型语料库对于许多自然语言处理任务至关重要。 WaCky(Web-As-Corpus Kool Yinitiative)方法已用于为英语,意大利语和德语等语言建立非常大的语料库(每个词库超过10亿个单词)。在本文中,我们介绍了构建brWaC的正在进行的工作,brWaC是从.br域中抓取的庞大的巴西葡萄牙语语料库。目前,抓取过程和PoS标记已完成,从而生成了30亿个单词的标记化和词形化语料库。下一步是解析整个语料库。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号