...
首页> 外文期刊>Software Quality Journal >An1improved text classification modelling approach to identify security messages in heterogeneous projects
【24h】

An1improved text classification modelling approach to identify security messages in heterogeneous projects

机译:AN1imroved文本分类建模方法,以识别异构项目中的安全消息

获取原文
获取原文并翻译 | 示例
           

摘要

Security remains under-addressed in many organisations, illustrated by the number of large-scale software security breaches. Preventing breaches can begin during software development if attention is paid to security during the software's design and implementation. One approach to security assurance during software development is to examine communications between developers as a means of studying the security concerns of the project. Prior research has investigated models for classifying project communication messages (e.g., issues or commits) as security related or not. A known problem is that these models are project-specific, limiting their use by other projects or organisations. We investigate whether we can build a generic classification model that can generalise across projects. We define a set of security keywords by extracting them from relevant security sources, dividing them into four categories: asset, attack/threat, control/mitigation, and implicit. Using different combinations of these categories and including them in the training dataset, we built a classification model and evaluated it on industrial, open-source, and research-based datasets containing over 45 different products. Our model based on harvested security keywords as a feature set shows average recall from 55 to 86%, minimum recall from 43 to 71% and maximum recall from 60 to 100%. An average f-score between 3.4 and 88%, an average g-measure of at least 66% across all the dataset, and an average AUC of ROC from 69 to 89%. In addition, models that use externally sourced features outperformed models that use project-specific features on average by a margin of 26-44% in recall, 22-50% in g-measure, 0.4-28% in f-score, and 15-19% in AUC of ROC. Further, our results outperform a state-of-the-art prediction model for security bug reports in all cases. We find using sound statistical and effect size tests that (1) using harvested security keywords as features to train a text classification model improve classification models and generalise to other projects significantly. (2) Including features in the training dataset before model construction improve classification models significantly. (3) Different security categories represent predictors for different projects. Finally, we introduce new and promising approaches to construct models that can generalise across different independent projects.
机译:在许多组织中,安全仍在解决,所以通过大规模软件安全漏洞的数量说明。如果在软件的设计和实现期间,如果关注安全性,则可以在软件开发期间开始防止泄露。软件开发期间安全保证的一种方法是审查开发人员之间的通信,作为研究项目安全问题的手段。先前的研究已经调查了将项目通信消息(例如,问题或提交)分类为相关或不相关的模型。已知问题是这些模型是特定于项目的,限制了其他项目或组织的使用。我们调查我们是否可以构建一个可以跨越项目概括的通用分类模型。我们通过从相关的安全源中提取它们来定义一组安全关键字,将它们划分为四类:资产,攻击/威胁,控制/缓解和隐含。使用这些类别的不同组合并将其包括在训练数据集中,我们构建了一个分类模型,并在包含超过45种不同产品的工业,开源和基于研究的数据集中进行评估。我们的模型基于收获的安全关键字作为一个功能集显示平均召回从55到86%,最小召回从43到71%,最大召回从60到100%。平均F分数在3.4和88%之间,平均G-衡量标准在所有数据集中至少为66%,平均AUC的ROC从69%到89%。此外,使用外部源的模型表现出使用项目特定功能的表现优于26-44%的召回,22-50%,F分数为0.4-28%,15在ROC AUC中的-19%。此外,我们的结果优于所有情况下的安全错误报告的最先进的预测模型。我们发现使用声音统计和效果大小测试(1)使用收获的安全关键字作为培训文本分类模型的功能,提高分类模型并显着地推广到其他项目。 (2)包括培训数据集中的功能,在模型施工之前显着提高分类模型。 (3)不同的安全类别代表不同项目的预测因子。最后,我们介绍了新的和有希望的方法来构建可以贯穿不同独立项目的模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号