...
首页> 外文期刊>International Journal of Modern Physics, B. Condensed Matter Physics, Statistical Physics, Applied Physics >Long-range correlations and burstiness in written texts: Universal and language-specific aspects
【24h】

Long-range correlations and burstiness in written texts: Universal and language-specific aspects

机译:书面文本中的远程相关性和突发性:通用和特定于语言的方面

获取原文
获取原文并翻译 | 示例
           

摘要

Recently, methods from the statistical physics of complex systems have been applied successfully to identify universal features in the long-range correlations (LRCs) of written texts. However, in real texts, these universal features are being intermingled with language-specific influences. This paper aims at the characterization and further understanding of the interplay between universal and language-specific effects on the LRCs in texts. To this end, we apply the language-sensitive mapping of written texts to word-length series (wls) and analyse large parallel (of same content) corpora from 10 languages classified to four families (Romanic, Germanic, Greek and Uralic). The autocorrelation functions of the wls reveal tiny but persistent LRCs decaying at large scales following a power-law with a language-independent exponent similar to 0.60-0.65. The impact of language is displayed in the amplitude of correlations where a relative standard deviation > 40% among the analyzed languages is observed. The classification to language families seems to play a significant role since, the Finnish and Germanic languages exhibit more correlations than the Greek and Roman families. To reveal the origins of the LRCs, we focus on the long words and perform burst and correlation analysis in their positions along the corpora. We find that the universal features are linked more to the correlations of the inter-long word distances while the language-specific aspects are related more to their distributions.
机译:最近,已经成功地应用了来自复杂系统的统计物理学的方法来识别书面文本的远程关联(LRC)中的通用特征。但是,在实际文本中,这些通用功能正与特定语言的影响混合在一起。本文旨在表征和进一步理解文本中LRC的通用和特定于语言的影响之间的相互作用。为此,我们将书面文本的语言敏感映射应用于字长序列(wls),并分析来自10种语言的大型并行(相同内容)语料库,这些语料被分为四个家族(浪漫,日耳曼,希腊和乌拉尔语)。 wls的自相关函数揭示了幂函数的微小但持久的LRC在幂律下具有与语言无关的指数类似于0.60-0.65的大规模衰减。语言的影响以相关的幅度显示,在分析的语言中观察到相对标准偏差> 40%。语言族的分类似乎起着重要作用,因为芬兰和日耳曼语比希腊和罗马族显示出更多的相关性。为了揭示LRC的起源,我们关注长字并对其在语料库中的位置进行猝发和相关分析。我们发现,通用特征更多地与长字间距离的相关性相关,而特定于语言的方面则与它们的分布更相关。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号