首页> 外文OA文献 >Alternative ways to estimate change points in multinomial sequences. An application to an authorship attribution problem
【2h】

Alternative ways to estimate change points in multinomial sequences. An application to an authorship attribution problem

机译:估计多项式序列中的变化点的替代方法。著作权归属问题的申请

摘要

The statistical analysis of literary style is the part of stylometry that compares measurable characteristicsin a text that are rarely controlled by the author, with those in other texts. When thegoal is to settle authorship questions, these characteristics should relate to the author’s style andnot to the genre, epoch or editor, and they should be such that their variation between authors islarger than the variation within comparable texts from the same author.For an overview of the literature on stylometry and some of the techniques involved, see for exampleMosteller and Wallace (1964, 82), Herdan (1964), Morton (1978), Holmes (1985), Oakes (1998) orLebart, Salem and Berry (1998).Tirant lo Blanc, a chivalry book, is the main work in catalan literature and it was hailed to be“the best book of its kind in the world” by Cervantes in Don Quixote. Considered by writterslike Vargas Llosa or Damaso Alonso to be the first modern novel in Europe, it has been translatedseveral times into Spanish, Italian and French, with modern English translations by Rosenthal(1996) and La Fontaine (1993). The main body of this book was written between 1460 and 1465,but it was not printed until 1490.There is an intense and long lasting debate around its authorship sprouting from its first edition,where its introduction states that the whole book is the work of Martorell (1413?-1468), while atthe end it is stated that the last one fourth of the book is by Galba (?-1490), after the death ofMartorell. Some of the authors that support the theory of single authorship are Riquer (1990),Chiner (1993) and Badia (1993), while some of those supporting the double authorship are Riquer(1947), Coromines (1956) and Ferrando (1995). For an overview of this debate, see Riquer (1990).Neither of the two candidate authors left any text comparable to the one under study, and thereforediscriminant analysis can not be used to help classify chapters by author. By using sample textsencompassing about ten percent of the book, and looking at word length and at the use of 44conjunctions, prepositions and articles, Ginebra and Cabos (1998) detect heterogeneities that mightindicate the existence of two authors. By analyzing the diversity of the vocabulary, Riba andGinebra (2000) estimates that stylistic boundary to be near chapter 383.Following the lead of the extensive literature, this paper looks into word length, the use of the mostfrequent words and into the use of vowels in each chapter of the book. Given that the featuresselected are categorical, that leads to three contingency tables of ordered rows and therefore tothree sequences of multinomial observations.Section 2 explores these sequences graphically, observing a clear shift in their distribution. Section 3describes the problem of the estimation of a suden change-point in those sequences, in the followingsections we propose various ways to estimate change-points in multinomial sequences; the methodin section 4 involves fitting models for polytomous data, the one in Section 5 fits gamma modelsonto the sequence of Chi-square distances between each row profiles and the average profile, theone in Section 6 fits models onto the sequence of values taken by the first component of thecorrespondence analysis as well as onto sequences of other summary measures like the averageword length. In Section 7 we fit models onto the marginal binomial sequences to identify thefeatures that distinguish the chapters before and after that boundary. Most methods rely heavilyon the use of generalized linear models
机译:对文体的统计分析是笔法的一部分,它比较了作者很少控制的文本中的可测量特征与其他文本中的可测量特征。当目标是解决作者身份问题时,这些特征应与作者的风格有关,而与风格,时代或编辑者无关,并且应使其在作者之间的差异大于在同一作者的可比较文本中的差异。有关笔法和一些涉及的技术的文献,请参见例如Mosteller和Wallace(1964,82),Herdan(1964),Morton(1978),Holmes(1985),Oakes(1998)或Lebart,Salem和Berry(1998)特兰特·勃朗特(Tirant lo Blanc)是一本骑士小说,是加泰罗尼亚文学的主要著作,被塞万提斯(Cervantes)在唐吉x德(Don Quixote)誉为“世界同类最佳书”。像瓦尔加斯·洛萨(Vargas Llosa)或达马索·阿隆索(Damaso Alonso)这样的作家认为这是欧洲第一本现代小说,该小说已多次被翻译成西班牙文,意大利文和法文,Rosenthal(1996)和La Fontaine(1993)则进行了现代英语翻译。这本书的主体是在1460年至1465年之间编写的,但直到1490年才出版。关于本书作者身份的争论自第一版开始就引起了长期的激烈讨论,其中引言指出,整本书是作者的著作。马托雷尔(1413?-1468),但最后指出,书的最后四分之一是加尔巴(?-1490)在马托雷尔去世后写的。一些支持单作者理论的作者是Riquer(1990),Chiner(1993)和Badia(1993),而一些支持双重作者身份的作者是Riquer(1947),Coromines(1956)和Ferrando(1995)。 。有关这一辩论的概述,请参见Riquer(1990)。两位候选作者均未留下与所研究的论文相当的文章,因此不能使用区分分析来帮助按作者分类。 Ginebra and Cabos(1998)通过使用样本文本占本书约10%的内容,并研究了单词的长度以及使用44个连词,介词和文章,发现可能表明两位作者存在的异质性。通过分析词汇的多样性,Riba和Ginebra(2000)估计文体边界接近383章。在大量文献的基础上,本文研究了词长,最常用词的使用以及元音的使用在本书的每一章中。假设选择的特征是分类的,这将导致产生三个有序行的列联表,因此会出现三个多项式观测序列。第二部分以图形方式探索了这些序列,观察了它们分布的明显变化。第三节描述了估计那些序列中突然变化点的问题,在以下部分中,我们提出了多种方法来估算多项式序列中的变化点;第4节中的方法涉及对多态数据进行拟合的模型,第5节中的模型将gamma模型拟合到每行轮廓与平均轮廓之间的卡方距离序列,第6节中的模型将模型拟合到第一个数据集所获取的值序列上对应分析的组成部分以及其他汇总度量(如平均单词长度)序列。在第7节中,我们将模型拟合到边缘二项式序列上,以识别可区分该边界前后各章的特征。大多数方法严重依赖于广义线性模型的使用

著录项

  • 作者

    Riba Alex; Ginebra Josep;

  • 作者单位
  • 年度 2003
  • 总页数
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号