【24h】

Automatically Building a Corpus for a Minority Language from the Web

机译:通过网络自动为少数民族语言建立语料库

获取原文
获取原文并翻译 | 示例

摘要

We present an approach to language-specific query-based sampling which, given a single document in a target language, can find many more examples of documents in that language, by automatically constructing queries to access such documents on the world wide web. We propose a number of methods for building search queries to quickly obtain documents in the target language. They perform accurately and efficiently for building a corpus of documents in Tagalog starting from a single seed document, when these documents are only 2.5% of the documents in a collection. We found that a simple approach - of sampling with a query consisting of the most frequent word from the minority language corpus constructed so far - was very successful. This method built a corpus of documents with word frequencies similar to those in the corpus based on all Tagalog documents in our collection, and required a relatively small number of search queries. It also quickly acquired a good coverage of vocabulary terms. However, adding an element of randomness to the query may give greater coverage, although more queries are required.
机译:我们提供了一种针对特定语言的基于查询的采样方法,该方法在给定目标语言的单个文档的情况下,可以通过自动构造查询来访问万维网上的此类文档,从而找到该语言的更多文档示例。我们提出了许多构建搜索查询的方法,以快速获取目标语言的文档。当这些文档仅占集合文档的2.5%时,它们会从单个种子文档开始准确高效地构建Tagalog文档库。我们发现,一种简单的方法-成功的抽样是一种非常简单的方法-抽样查询,该查询包含来自到目前为止构建的少数民族语言语料库中最频繁出现的单词。这种方法基于我们集合中的所有他加禄语文档,构建了一个词库,其词频与语料库中的词频相似,并且需要相对较少的搜索查询。它还很快获得了很好的词汇量覆盖。但是,尽管需要更多查询,但向查询中添加随机性元素可能会提供更大的覆盖范围。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号