...
首页> 外文期刊>Empirical Software Engineering >Empirical evaluation of tools for hairy requirements engineering tasks
【24h】

Empirical evaluation of tools for hairy requirements engineering tasks

机译:毛茸茸需求工程任务工具的实证评价

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Context A hairy requirements engineering (RE) task involving natural language (NL) documents is (1) a non-algorithmic task to find all relevant answers in a set of documents, that is (2) not inherently difficult for NL-understanding humans on a small scale, but is (3) unmanageable in the large scale. In performing a hairy RE task, humans need more help finding all the relevant answers than they do in recognizing that an answer is irrelevant. Therefore, a hairy RE task demands the assistance of a tool that focuses more on achieving high recall, i.e., finding all relevant answers, than on achieving high precision, i.e., finding only relevant answers. As close to 100% recall as possible is needed, particularly when the task is applied to the development of a high-dependability system. In this case, a hairy-RE-task tool that falls short of close to 100% recall may even be useless, because to find the missing information, a human has to do the entire task manually anyway. On the other hand, too much imprecision, too many irrelevant answers in the tool's output, means that manually vetting the tool's output to eliminate the irrelevant answers may be too burdensome. The reality is that all that can be realistically expected and validated is that the recall of a hairy-RE-task tool is higher than the recall of a human doing the task manually.Objective Therefore, the evaluation of any hairy-RE-task tool must consider the context in which the tool is used, and it must compare the performance of a human applying the tool to do the task with the performance of a human doing the task entirely manually, in the same context. The context of a hairy-RE-task tool includes characteristics of the documents being subjected to the task and the purposes of subjecting the documents to the task. However, traditionally, many a hairy-RE-task tool has been evaluated by considering only (1) how high is its precision, or (2) how high is its F-measure, which weights recall and precision equally, ignoring the context, and possibly leading to incorrect - often underestimated - conclusions about how effective it is.Method To evaluate a hairy-RE-task tool, this article offers an empirical procedure that takes into account not only (1) the performance of the tool, but also (2) the context in which the task is being done, (3) the performance of humans doing the task manually, and (4) the performance of those vetting the tool's output. The empirical procedure uses (I) on one hand, (1) the recall and precision of the tool, (2) a contextually weighted F-measure for the tool, (3) a new measure called summarization of the tool, and (4) the time required for vetting the tool's output, and (II) on the other hand, (1) the recall and precision achievable by and (2) the time required by a human doing the task.Results The use of the procedure is shown for a variety of different contexts, including that of successive attempts to improve the recall of an imagined hairy-RE-task tool. The procedure is shown to be context dependent, in that the actual next step of the procedure followed in any context depends on the values that have emerged in previous steps.Conclusion Any recommendation for a hairy-RE-task tool to achieve close to 100% recall comes with caveats and may be required only in specific high-dependability contexts. Appendix C applies parts of this procedure to several hairy-RE-task tools reported in the literature using data published about them. The surprising finding is that some of the previously evaluated tools are actually better than they were thought to be when they were evaluated using mainly precision or an unweighted F-measure.
机译:背景信息涉及自然语言(NL)文件的毛茸茸的需求规模小,但是(3)在大规模中无法控制。在表演毛茸茸的RE任务时,人类需要更多地帮助找到所有相关答案,而不是他们认识到答案是无关紧要的。因此,毛茸茸的RE任务要求一个工具的帮助,该工具专注于实现高召回,即找到所有相关答案,而不是实现高精度,即只有相关答案。尽可能接近100%召回,特别是当任务应用于高可靠性系统的开发时。在这种情况下,尚未接近100%召回的毛茸茸的重新任务工具甚至可能是无用的,因为要找到缺少的信息,但是无论如何,人类必须手动执行整个任务。另一方面,在工具的输出中太多不可思议,太多了不相关的答案,意味着手动审核工具的输出以消除无关答案可能太繁重。现实情况是,可以进行现实地预期和验证的全部是毛发重新任务工具的召回高于手动执行任务的人员的召回。因此,对任何毛茸茸的重新任务工具的评估必须考虑使用该工具的上下文,并且它必须比较人类应用工具的性能,以便在同一上下文中完全手动执行任务的人类的性能。毛茸茸重新任务工具的上下文包括所经历任务的文档的特征以及对任务进行文档的目的。然而,传统上,许多毛茸茸的重新任务工具已经通过考虑仅考虑(1),其精度有多高,或(2)其F测量值有多高,其重量召回和精度平均,忽略了上下文,可能导致错误 - 通常低估了 - 结论关于如何效果是如何评估一个毛茸茸的重新任务工具,本文提供了一个经验过程,它不仅考虑了工具的性能,还要考虑到(1)工具的性能(2)完成任务的上下文,(3)人类手动执行任务的表现,以及(4)介绍工具输出的性能。经验过程用(i)一方面使用(1)工具的召回和精确度,(2)工具的上下文加权F测量,(3)称为工具摘要的新度量,以及(4 )审查该工具输出所需的时间和(ii)另一方面,(1)可实现的召回和精度和(2)人类执行任务所需的时间。结果显示了该过程的使用对于各种不同的上下文,包括连续尝试改善想象的毛茸茸重新任务工具的召回。该过程被依赖于上下文,因为在任何上下文中遵循的过程的实际下一步取决于先前步骤中出现的值.Conclusion对毛茸茸重新任务工具的任何建议实现接近100%召回配有警告,只能在特定的高可靠性上下文中需要。附录C将该过程的一部分应用于在文献中的几个毛茸茸重新任务工具使用关于它们的数据。令人惊讶的发现是,一些先前评估的工具实际上比以主要是精确或未加权的F措施评估它们的时间更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号