This distribution is counter-intuitive for at least two reasons. First it would seem "obvious" that the numbers drawn from a list generated from widely different arbitrary processes would have roughly equally probabilities for the digits 1 and 9 to be first digits. This is not normally the case. If the list of numbers does not have artificial limits, or include invented numbers such as postal codes, then approximately 30% of the numbers will have 1 as their first digit, but only 5% will have 9 as their first digit. Deviations from the expected Benford Distribution indicate the presence of some special characteristic of the data. The second, more theoretically challenging, problem is: What is the underlying property associated with so many widely different processes which generates lists of numbers that follow Benford's Law?
We have conducted an empirical investigation to determine under what circumstances various software metrics follow Benford's Law, and whether any special characteristics, or irregularities, in the data can be uncovered if the data are found not to follow the law. The more tricky problem of understanding why the list of metrics might follow Benford's Law is left to another study.
Lists were form from three software metrics extracted from 100 public domain industrial Java Projects. These metrics were Lines of Code (LOC), Fan-Out (FO) and McCabe Cyclomatic Complexity (MCC). Given that a Benford's Law analysis requires a list of considerable length, the data were divided into two groups. The first groups was from projects containing more than 100 files. This was intended as the "control group" and what was expected to follow Benford's Law if that Law was applicable for the analysis of software engineering metrics. To study the sensitivity of the digital analysis technique to project size, projects with a smaller number of files were compared to the control group.
The empirical results indicate that the first digits of numbers in lists of LOC metrics extracted from the projects closely followed the probabilities predicted by Benford's Law than an "equal probability of occurrence" suggested by intuitive reasoning. This was shown using both qualitative and quantitative measures. The FO and MCC metrics did not follow the standard Benford's Law as well as did the LOC metrics. This is because the FO and MCC lists contain a significant number of numbers less than 10 and follow a different first digit distribution. Further investigation of the digital analysis technique is necessary to evaluate the applicability of Benford's Law in the total context of Software Metrics.
由于至少两个原因,此分布是违反直觉的。首先,似乎“显而易见”的是,从由广泛不同的任意过程生成的列表中得出的数字对于数字1和9成为第一位数字具有大致相同的概率。通常情况并非如此。如果数字列表没有人为限制,或者包括诸如邮政编码的发明数字,那么大约30%的数字的第一位数字为1,但是只有5%的第一位数字为9。与预期的本福德分布的偏差表明存在某些特殊数据特征。第二个在理论上更具挑战性的问题是:与如此众多不同的过程相关联的潜在属性是什么,这些过程生成遵循本福德定律的数字列表? P>
我们进行了一项实证研究,以确定各种软件指标在什么情况下遵循本福德定律,如果发现数据不符合法律,则是否可以发现数据中的任何特殊特征或不规则性。理解为什么指标列表可能遵循本福德定律的问题更加棘手,这留给另一项研究。 P>
列表是从从100个公共领域工业Java项目中提取的三个软件指标形成的。这些度量标准是代码行(LOC),扇出(FO)和McCabe循环复杂度(MCC)。鉴于本福德定律分析需要一个相当长的列表,因此将数据分为两组。第一组来自包含100多个文件的项目。这原本是“控制组”,如果该法律适用于软件工程指标的分析,则应遵循该法律。为了研究数字分析技术对项目规模的敏感性,将文件数量较少的项目与对照组进行了比较。 P>
实证结果表明,从项目中提取的LOC指标列表中的数字的第一位数与本福德定律预测的概率密切相关,而与直观推理所建议的“发生概率相等”密切相关。使用定性和定量方法均表明了这一点。 FO和MCC指标以及LOC指标均未遵循标准的本福德定律。这是因为FO和MCC列表包含大量小于10的数字,并且遵循不同的第一位数字分布。在软件度量的整体背景下,有必要进一步研究数字分析技术,以评估本福德定律的适用性。 P>
机译:检查,检查,再次检查:在重复检查的多个会话中调查内存退化
机译:根据经验评估的软件工程调查清单
机译:测量驱动的过程和体系结构,用于软件技术的经验评估
机译:一种新方法检查软件工程测量过程的完整性的实证研究
机译:黑,灰和白盒侧通道编程,用于软件完整性检查
机译:快乐的软件开发人员可以更好地解决问题:经验软件工程中的心理测量
机译:第19届需求工程国际工作会议:软件质量基础(REFsQ 2013)。 2013年REFsQ研讨会CreaRE,IWspm和RepriCo,REFsQ 2013经验轨道(实证实验和实证研究会),REFsQ 2013博士研讨会和REFsQ 2013海报会议的会议记录