首页> 美国卫生研究院文献>PLoS Clinical Trials >Large-Scale Analysis of Zipf’s Law in English Texts
【2h】

Large-Scale Analysis of Zipf’s Law in English Texts

机译:齐普夫定律在英语文本中的大规模分析

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Despite being a paradigm of quantitative linguistics, Zipf’s law for words suffers from three main problems: its formulation is ambiguous, its validity has not been tested rigorously from a statistical point of view, and it has not been confronted to a representatively large number of texts. So, we can summarize the current support of Zipf’s law in texts as anecdotic. We try to solve these issues by studying three different versions of Zipf’s law and fitting them to all available English texts in the Project Gutenberg database (consisting of more than 30 000 texts). To do so we use state-of-the art tools in fitting and goodness-of-fit tests, carefully tailored to the peculiarities of text statistics. Remarkably, one of the three versions of Zipf’s law, consisting of a pure power-law form in the complementary cumulative distribution function of word frequencies, is able to fit more than 40% of the texts in the database (at the 0.05 significance level), for the whole domain of frequencies (from 1 to the maximum value), and with only one free parameter (the exponent).
机译:尽管是定量语言学的范式,但齐普夫的单词定律仍存在三个主要问题:其表述模棱两可,其有效性尚未从统计学的角度进行严格检验,并且尚未遇到大量具有代表性的文本。 。因此,我们可以将齐普夫定律目前在文本中的支持总结为轶事。我们试图通过研究Zipf定律的三个不同版本并将它们与Project Gutenberg数据库中的所有可用英语文本(包含3万多个文本)相匹配来解决这些问题。为此,我们使用最先进的工具进行拟合和拟合优度测试,并针对文本统计的特殊性精心定制。值得注意的是,Zipf定律的三个版本之一由单词频率的互补累积分布函数中的纯幂定律形式组成,能够拟合数据库中40%以上的文本(0.05个显着性水平) ,适用于整个频率域(从1到最大值),并且只有一个自由参数(指数)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号