首页> 外文会议>IEEE International Symposium on Software Reliability Engineering Workshops >Multi-label Classification of Commit Messages using Transfer Learning
【24h】

Multi-label Classification of Commit Messages using Transfer Learning

机译:使用转移学习的提交消息的多标签分类

获取原文
获取外文期刊封面目录资料

摘要

Commit messages are used in the industry by developers to annotate changes made to the code. Accurate classification of these messages can help monitor the software evolution process and enable better tracking for various industrial stakeholders. In this paper, we present a state of the art method for commit message classification into categories as per Swanson’s maintenance activities i.e. “Corrective”, “Perfective”, and “Adaptive”. This is a challenging task because not all commit messages are well written and informative. Existing approaches rely on keyword-based techniques to solve this problem. However, these approaches are oblivious to the full language model and do not recognize the contextual relationship between words. State of the art methodology in Natural Language Processing (NLP), is to train a context-aware neural network (Transformer) on a very large data set that encompasses the entire language and then fine-tunes it for a specific task. In this way, the model can learn the language, pay attention to the context, and then transfer that knowledge for better performance at the specific task. We use an off-the-shelf neural network called DistilBERT and fine-tune it for commit message classification task. This step is non-trivial because programming languages and commit messages have unique keywords, jargon, and idioms. This paper presents our effort in training this model and constructing the data set for this task. We describe the rules used to construct the data set. We validate our approach on industrial projects from GitHub, such as Kubernetes, Linux, TensorFlow, Spark, TypeScript, and PyTorch. We were able to achieve 87% F1-score for the commit message classification task, which is an order of magnitude accurate than previous studies.
机译:提交消息在业内通过开发人员的代码做注释的变化中。这些消息的准确分类可以帮助监控软件演化过程和启用各种工业利益相关者更好的跟踪。在本文中,我们提出的技术方法,用于提交信息分类成类别按照Swanson的的维护活动即“纠正”的状态,“完成式”和“自适应”。这是一项艰巨的任务,因为不是所有提交的信息都写得很好,内容翔实。现有方案依赖于基于关键字的方法来解决这个问题。然而,这些方法都浑然不觉完整的语言模型和不认识的单词之间的上下文关系。在自然语言处理(NLP)的技术方法的国家,是就涵盖了整个语言,然后微调其特定任务一个非常大的数据集训练情景感知神经网络(变压器)。通过这种方式,该模型可以学习语言,上下文讲究,然后在特定的任务转移的知识有更好的表现。我们使用一个被DistilBERT关闭的,现成的神经网络和微调它提交信息分类任务。这一步是不平凡的,因为编程语言和提交信息有唯一关键字,行话和成语。本文介绍了我们在训练这个模型,构建数据集用于这个任务的努力。我们描述了用于构建数据集的规则。我们确认我们对从GitHub工业项目,如Kubernetes,Linux和TensorFlow,星火,打字稿及PyTorch方法。我们能够实现87%的F1-比分为提交信息分类任务,这是数量级比以前的研究准确的顺序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号