首页> 外文会议>International conference on computational linguistics >Evaluating different methods for automatically collecting large general corpora for Basque from the web
【24h】

Evaluating different methods for automatically collecting large general corpora for Basque from the web

机译:评估从网络上自动收集巴斯克大型通用语料库的不同方法

获取原文

摘要

In the last few years, much work has been done to build Basque corpora. But we still lack a large general corpus of a size comparable with those existing in other major languages, and much more so if we take into account the corpora lately built automatically from the web, which nowadays account for billions of word-sized corpora for English, German, Spanish, etc. As Basque is an under-resourced language, it is thus logical that we should also turn to this cheap and fast method of collecting corpora. In this paper we present the research we have done to build a large general corpus of Basque from the web. We have tried and evaluated which of the two methods mentioned in the literature, that is, by crawling or by using search engines, best suits Basque, in terms of parameters such as speed, cost, size or quality. Our conclusion is that crawling is the one that has the potential for building the largest corpora for Basque. Using this method we have built a good quality corpus of more than 100 million words, and we expect to build a much larger one in the near future.
机译:在过去的几年中,为建立巴斯克语料库已经做了很多工作。但是,我们仍然缺乏一个大型的通用语料库,其大小可以与其他主要语言中的现有语料库相提并论,如果考虑到最近从网络自动构建的语料库,则更是如此。如今,该语料库已占英语单词大小的语料库,德语,西班牙语等语言。由于巴斯克语是一种资源匮乏的语言,因此顺理成章的是,我们也应该转向这种廉价而快速的语料库收集方法。在本文中,我们介绍了我们通过网络构建大型巴斯克语料库所进行的研究。我们已经尝试并评估了文献中提到的两种方法中的哪一种,即通过爬网或使用搜索引擎,在速度,成本,大小或质量等参数方面最适合巴斯克。我们的结论是,爬行是有可能为巴斯克建立最大的语料库。使用这种方法,我们已经建立了超过1亿个单词的高质量语料库,并且我们希望在不久的将来建立更大的语料库。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号