Knowing the Unseen: Estimating Vocabulary Size over Unseen Samples

机译：认识看不见的人：估算看不见的样本的词汇量

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Empirical studies on corpora involve making measurements of several quantities for the purpose of comparing corpora, creating language models or to make generalizations about specific linguistic phenomena in a language. Quantities such as average word length are stable across sample sizes and hence can be reliably estimated from large enough samples. However, quantities such as vocabulary size change with sample size. Thus measurements based on a given sample will need to be extrapolated to obtain their estimates over larger unseen samples. In this work, we propose a novel nonparametric estimator of vocabulary size. Our main result is to show the statistical consistency of the estimator - the first of its kind in the literature. Finally, we compare our proposal with the state of the art estimators (both parametric and nonparametric) on large standard corpora; apart from showing the favorable performance of our estimator, we also see that the classical Good-Turing estimator consistently underestimates the vocabulary size.

机译：关于语料库的实证研究涉及对多个量度进行测量，以比较语料库，创建语言模型或对语言中的特定语言现象进行概括。诸如平均字长之类的数量在样本大小之间是稳定的，因此可以从足够大的样本中可靠地进行估计。但是，词汇量之类的数量随样本大小而变化。因此，将需要对基于给定样本的测量结果进行推断，以获得对较大的未见样本的估计。在这项工作中，我们提出了一种新颖的词汇量非参数估计量。我们的主要结果是证明估计量的统计一致性-这在文献中尚属首次。最后，我们将我们的建议与大型标准语料库的最新估计量（参数和非参数）进行比较;除了显示估计器的良好性能之外，我们还看到经典的Good-Turing估计器始终低估了词汇量。

著录项

来源
《Joint conference of the annual meeting of the Association for Computational Linguistics;International joint conference on natural language processing of the Asian Federation of Natural Languages Processing;ACL 2009;IJCNLP 2009》|2009年|P.109-117|共9页
会议地点
作者
Suma Bhat; Richard Sproat;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序语言、算法语言;
关键词

相似文献

外文文献
中文文献
专利

1. Estimating the unseen: A sublinear-sample canonical estimator of distributions [J] . Electronic Colloquium on Computational Complexity . 2010,第3期

机译：估计看不见的：分布的次线性样本规范估计
2. Knowing who dunnit: Infants identify the causal agent in an unseen causal interaction [J] . Saxe R, Tzelnic T, Carey S Developmental psychology . 2007,第1期

机译：知道谁傻了：婴儿在看不见的因果关系中识别出因果关系
3. INSPECTRE: Privately Estimating the Unseen [J] . Jayadev Acharya, Gautam Kamath, Ziteng Sun, JMLR: Workshop and Conference Proceedings . 2018,第2010期

机译：检查：私下估计看不见的东西
4. Knowing the Unseen: Estimating Vocabulary Size over Unseen Samples [C] . Joint conference of the annual meeting of the Association for Computational Linguistics . 2009

机译：知道看不见的：估算看不见样品的词汇量
5. Improving IRT parameter estimates with small sample sizes: Evaluating the efficacy of a new data augmentation technique. [D] . Foley, Brett Patrick. 2010

机译：使用小样本量来改善IRT参数估计：评估新数据增强技术的功效。
6. Estimating the number of unseen variants in the human genome [O] . Iuliana Ionita-Laza, Christoph Lange, Nan M. Laird 2009

机译：估计人类基因组中看不见的变体的数量
7. Knowing the unseen [O] . Suma Bhat, Richard Sproat 2009

机译：知道看不见

Knowing the Unseen: Estimating Vocabulary Size over Unseen Samples

摘要

著录项

相似文献

相关主题

期刊订阅