【24h】

Knowing the Unseen: Estimating Vocabulary Size over Unseen Samples

机译:认识看不见的人:估算看不见的样本的词汇量

获取原文

摘要

Empirical studies on corpora involve making measurements of several quantities for the purpose of comparing corpora, creating language models or to make generalizations about specific linguistic phenomena in a language. Quantities such as average word length are stable across sample sizes and hence can be reliably estimated from large enough samples. However, quantities such as vocabulary size change with sample size. Thus measurements based on a given sample will need to be extrapolated to obtain their estimates over larger unseen samples. In this work, we propose a novel nonparametric estimator of vocabulary size. Our main result is to show the statistical consistency of the estimator - the first of its kind in the literature. Finally, we compare our proposal with the state of the art estimators (both parametric and nonparametric) on large standard corpora; apart from showing the favorable performance of our estimator, we also see that the classical Good-Turing estimator consistently underestimates the vocabulary size.
机译:关于语料库的实证研究涉及对多个量度进行测量,以比较语料库,创建语言模型或对语言中的特定语言现象进行概括。诸如平均字长之类的数量在样本大小之间是稳定的,因此可以从足够大的样本中可靠地进行估计。但是,词汇量之类的数量随样本大小而变化。因此,将需要对基于给定样本的测量结果进行推断,以获得对较大的未见样本的估计。在这项工作中,我们提出了一种新颖的词汇量非参数估计量。我们的主要结果是证明估计量的统计一致性-这在文献中尚属首次。最后,我们将我们的建议与大型标准语料库的最新估计量(参数和非参数)进行比较;除了显示估计器的良好性能之外,我们还看到经典的Good-Turing估计器始终低估了词汇量。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号