Understanding and explaining Delta measures for authorship attribution

Evert Stefan; Proisl Thomas; Jannidis FotisReger IsabellaPielstroem SteffenSchoech ChristofVitt Thorsten

摘要

This article builds on a mathematical explanation of one the most prominent stylometric measures, Burrows's Delta (and its variants), to understand and explain its working. Starting with the conceptual separation between feature selection, feature scaling, and distance measures, we have designed a series of controlled experiments in which we used the kind of feature scaling (various types of standardization and normalization) and the type of distance measures (notably Manhattan, Euclidean, and Cosine) as independent variables and the correct authorship attributions as the dependent variable indicative of the performance of each of the methods proposed. In this way, we are able to describe in some detail how each of these two variables interact with each other and how they influence the results. Thus we can show that feature vector normalization, that is, the transformation of the feature vectors to a uniform length of 1 (implicit in the cosine measure), is the decisive factor for the improvement of Delta proposed recently. We are also able to show that the information particularly relevant to the identification of the author of a text lies in the profile of deviation across the most frequent words rather than in the extent of the deviation or in the deviation of specific words only.

机译：本文建立在对最突出的文体测量之一 Burrows 的 Delta（及其变体）的数学解释之上，以理解和解释其工作原理。从特征选择、特征缩放和距离测量之间的概念分离开始，我们设计了一系列对照实验，其中我们使用特征缩放类型（各种类型的标准化和归一化）和距离测量类型（特别是曼哈顿、欧几里得和余弦）作为自变量，并将正确的作者归属作为指示每种方法性能的因变量。通过这种方式，我们能够详细描述这两个变量中的每一个如何相互作用以及它们如何影响结果。因此，我们可以证明，特征向量归一化，即将特征向量转换为均匀长度 1（隐含在余弦测度中），是最近提出的改进 Delta 的决定性因素。我们还能够表明，与确定文本作者特别相关的信息在于最常见单词的偏差概况，而不是偏差的程度或仅偏差特定词的偏差。

Understanding and explaining Delta measures for authorship attribution

摘要

著录项

引文网络

相关主题

期刊订阅