首页> 外文OA文献 >Exploring the Google Books Corpus: An Information-Theoretic Approach to Linguistic Evolution
【2h】

Exploring the Google Books Corpus: An Information-Theoretic Approach to Linguistic Evolution

机译:探索Google图书语料库:语言进化的信息论方法

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The Google Books corpus contains millions of books in a variety of languages. Due to this incredible volume and its free availability, it is a treasure trove that has inspired a plethora of linguistic research.It is tempting to treat frequency trends from Google Books data sets as indicators for the true popularity of various words and phrases. Doing so allows us to draw novel conclusions about the evolution of public perception of a given topic. However, sampling published works by availability and ease of digitization leads to several important effects, which have typically been overlooked in previous studies. One of these is the ability of a single prolific author to noticeably insert new phrases into a language. A greater effect arises from scientific texts, which have become increasingly prolific in the last several decades and are heavily sampled in the corpus. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We highlight these dynamics by examining and comparing major contributions to the statistical divergence of English data sets between decades in the period 1800--2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts, in clear contrast to the first version of the fiction data set and both unfiltered English data sets.We critique a method used by authors of an earlier work to determine the birth and death rates of words in a given linguistic data set. While intriguing, the method in question appears to produce an artificial surge in the death rate at the end of the observed period of time. In order to avoid boundary effects in our own analysis of asymmetries in language dynamics, we observe the volume of word flux across various relative frequency thresholds (in both directions) for the second English Fiction data set. We then use the contributions of the words crossing these thresholds to the Jensen-Shannon divergence between consecutive decades to resolve major factors driving the flux.Having established careful information-theoretic techniques to resolve important features in the evolution of the data set, we validate and refine our methods by analyzing the effects of major exogenous factors, specifically wars. This approach leads to a uniquely comprehensive set of methods for harnessing the Google Books corpus and exploring socio-cultural and linguistic evolution.
机译:Google图书语料库包含数百万种使用多种语言的书籍。由于数量庞大且免费,因此它是众多语言研究的宝库,并试图将Google图书数据集中的频率趋势视为各种单词和短语真正普及的指标。这样做可以使我们得出关于公众对特定主题的看法演变的新结论。但是,通过可用性和数字化的简便性对已发表的作品进行采样会导致一些重要的影响,而在以前的研究中通常都忽略了这些影响。其中之一是单个多产作者将新短语明显插入语言的能力。科学文献产生了更大的影响,这些科学文献在最近几十年中变得越来越多产,并在语料库中大量采样。结果是学术文章中常见但通常不太常见的短语激增,例如以引用形式引用时间。我们通过检查和比较1800--2000年几十年间对英语数据集统计差异的主要贡献来突出这些动态。我们发现只有第二版语料库的英语小说数据集不会受到专业文本的严重影响,这与第一版小说数据集和未过滤的英语数据集形成鲜明对比。我们批评了作者使用的一种方法确定给定语言数据集中单词的出生率和死亡率的早期工作。令人感兴趣的是,所讨论的方法似乎在观察到的时间段结束时导致了人为的死亡率激增。为了避免在我们自己的语言动力学不对称性分析中产生边界效应,我们观察了第二个英语小说数据集跨各个相对频率阈值(双向)的单词通量。然后,我们使用跨越这些阈值的单词对连续几十年间的Jensen-Shannon背离的贡献来解决驱动通量的主要因素。已经建立了谨慎的信息理论技术来解决数据集演变中的重要特征,我们验证并通过分析主要外在因素(特别是战争)的影响来完善我们的方法。这种方法带来了一套独特而全面的方法,可以利用Google图书的语料库并探索社会文化和语言的发展。

著录项

  • 作者

    Pechenick Eitan;

  • 作者单位
  • 年度 2015
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号