【24h】

brWaC: A WaCky Corpus for Brazilian Portuguese

机译:BRWAC:巴西葡萄牙语的古怪的语料库

获取原文

摘要

Initiatives for constructing very large corpora have increased in recent years, especially using the Web as corpus since large corpora are crucial for many Natural Language Processing tasks. The WaCky (Web-As-Corpus Kool Yinitiative) methodology has been used to build very large corpora (over a billion words each) for languages like English, Italian and German among others. In this paper we present the ongoing work on building brWaC, a massive Brazilian Portuguese corpus crawled from .br domains. At the moment, the crawling process and the PoS tagging are finished, resulting in a tokenized and lemmatized corpus of 3 billion words. Next step is parsing the whole corpus.
机译:近年来,构建非常大的Corpora的倡议增加了,特别是使用Web作为语料库,因为大型Corpora对于许多自然语言处理任务至关重要。古怪(网上语料库kool Yinitive)方法已被用来为英语,意大利语和德语等语言构建非常大的Corpora(每一个单词)。在本文中,我们展示了Brwac建筑的持续工作,这是一个来自.BR域的大规模巴西葡萄牙语毒品证据。此时,爬行过程和POS标记结束,导致牌化和lemmatized语料库为3亿字。下一步是解析整个语料库。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号