Code smells are poor code design or implementation that affect the code maintenance process and reduce the software quality. Therefore, code smell detection is important in software building. Recent studies utilized machine learning algorithms for code smell detection. However, most of these studies focused on code smell detection using Java programming language code smell datasets. This article proposes a Python code smell dataset for Large Class and Long Method code smells. The built dataset contains 1,000 samples for each code smell, with 18 features extracted from the source code. Furthermore, we investigated the detection performance of six machine learning models as baselines in Python code smells detection. The baselines were evaluated based on Accuracy and Matthews correlation coefficient (MCC) measures. Results indicate the superiority of Random Forest ensemble in Python Large Class code smell detection by achieving the highest detection performance of 0.77 MCC rate, while decision tree was the best performing model in Python Long Method code smell detection by achieving the highest MCC Rate of 0.89.
展开▼
机译:代码异味是影响代码维护过程并降低软件质量的不良代码设计或实现。因此,代码异味检测在软件构建中非常重要。最近的研究利用机器学习算法进行代码气味检测。然而,这些研究中的大多数都集中在使用 Java 编程语言 Code Smell 数据集进行代码气味检测。本文提出了一个 Python 代码味道数据集,用于 Large Class 和 Long Method 代码味道。构建的数据集包含每种代码味道的 1000 个样本,其中 18 个特征是从源代码中提取的。此外,我们研究了 6 个机器学习模型的检测性能,作为 Python 代码气味检测的基线。根据准确率和 Matthews 相关系数 (MCC) 测量评估基线。结果表明,随机森林集成在 Python 大类代码气味检测中具有优势,实现了 0.77 MCC 率的最高检测性能,而决策树是 Python Long Method 代码气味检测中性能最好的模型,实现了最高的 MCC 率 0.89。
展开▼