Predicting long-time contributors for GitHub projects using machine learning

Eluri Vijaya Kumar; Mazzuchi Thomas A.; Sarkani Shahram

首页> 外文期刊>Information and software technology >Predicting long-time contributors for GitHub projects using machine learning

【24h】

Predicting long-time contributors for GitHub projects using machine learning

机译：使用机器学习预测Github项目的长期贡献者

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Context: Many organizations develop software systems using open source software (OSS), which is risky due to the high possibility of losing support. Contributors are critical for the survival of OSS projects, but very few new contributors remain with OSS projects to become long-time contributors (LTCs). Identification of factors that contribute to become an LTC can help OSS project owners utilize limited resources to retain new contributors. Objective: In this paper, we investigate whether we can effectively predict new contributors to OSS repos-itories becoming long time contributors based on repository and contributor meta-data collected from GitHub. Method: We construct a dataset containing 70,899 observations from 888 most popular repositories with 56,766 contributors. Each observation represents a contributor who joined the repository and is categorized as either an LTC or a non-LTC, depending on whether their project tenure is longer than 3 years. Each observation has 31 features that are calculated using the information of the new contributor and the repository when a new contributor joins the project. We build several machine learning models, including naive Bayes, k-nearest neighbor, logistic regression, decision tree, and random forest to predict LTC validated using 10-fold cross-validation. We compare our best model with state of the art model in terms of precision, recall, F1-score, Matthews correlation coefficient (MCC), and area under the curve (AUC). Results: In 10-fold cross-validation, the precision, recall, F1-score, MCC, and AUC of our best model (random forest) are 0.695, 0.079, 0.140, 0.226, and 0.913, respectively. These values are 27.29%, 92.68%, 86.67%, 56.94%, and 0.55%, respectively better than the best baseline state of the art model (random forest). Conclusion: Compared to state of the art models, the models built using our approach use less than 50% features (31 vs 63), have no wait time of one month after the contributor joins to predict future LTC status, and produce better results.

机译：背景信息：许多组织使用开源软件（OSS）开发软件系统，这是由于丢失支持的高可能性而导致的风险。贡献者对于OSS项目的生存至关重要，但很少有新的贡献者仍然存在OSS项目成为长期贡献者（LTCS）。确定有助于成为LTC的因素可以帮助OSS项目业主利用有限的资源来保留新贡献者。目的：在本文中，我们调查了我们是否能够有效地预测OSS呼气课程的新贡献者是基于从GitHub收集的存储库和贡献者元数据成为长期贡献者。方法：我们构建一个数据集，其中包含70,899个最受欢迎的存储库的观察，其中包含56,766个贡献者。每个观察代表了加入存储库的贡献者，并根据其项目任期是否超过3年而被分类为LTC或非LTC。每次观察都有31个功能，使用新贡献者和存储库的信息计算，当新贡献者加入项目时。我们构建了几种机器学习模型，包括天真贝叶斯，k最近邻居，逻辑回归，决策树和随机森林，以预测使用10倍交叉验证验证的LTC。我们在精确度，召回，F1分数，马太基相关系数（MCC）和曲线下的区域（AUC）方面，将我们的最佳模型与最新的型号进行比较。结果：在10倍交叉验证中，我们最佳型号（随机林）的精确度，召回，F1分，MCC和AUC分别为0.695,0.079,0.140,0.226和0.913。这些值分别优于最佳基线（随机林）的最佳基线状态，这些值为27.29％，92.68％，86.67％，56.94％和0.55％。结论：与现有技术模型相比，使用我们的方法使用的模型使用少于50％的功能（31 VS 63），在贡献者加入以预测未来LTC状态后没有等待时间，并产生更好的结果。

著录项

来源
《Information and software technology》 |2021年第10期|106616.1-106616.13|共13页
作者
Eluri Vijaya Kumar; Mazzuchi Thomas A.; Sarkani Shahram;
展开▼
作者单位

George Washington Univ Syst Engn Washington DC 20052 USA;

George Washington Univ Sch Engn & Appl Sci Syst Engn & Engn Management Washington DC 20052 USA;

George Washington Univ Sch Engn & Appl Sci Syst Engn & Engn Management Washington DC 20052 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Long-time contributor; GitHub; GHTorrent; BigQuery; Machine learning models;

机译：长期贡献者;github;ghtorrent;bigquery;机器学习模型;

相似文献

外文文献
中文文献
专利

1. Using software metrics for predicting vulnerable classes and methods in Java projects: A machine learning approach [J] . Kazi Zakia Sultana, Vaibhav Anu, Tai-Yin Chong Journal of software maintenance and evolution rsearch and practice . 2021,第3期

机译：使用软件指标来预测Java项目中的易受攻击的类和方法：机器学习方法
2. An Empirical Comparison of Machine Learning Techniques in Predicting the Bug Severity of Open and Closed Source Projects [J] . K. K. Chaturvedi, V.B. Singh International journal of open source software & processes . 2012,第2期

机译：机器学习技术在预测开放源代码项目和封闭源代码项目的错误严重程度方面的经验比较
3. Application of machine learning in predicting construction project profit in Ghana using Support Vector Regression Algorithm (SVRA) [J] . Adinyira Emmanuel, Adjei Emmanuel Akoi-Gyebi, Agyekum Kofi, Engineering construction and architectural management . 2021,第5期

机译：机器学习在加纳施工项目利润预测使用支持向量回归算法（SVRA）
4. Using Dynamic and Contextual Features to Predict Issue Lifetime in GitHub Projects [C] . Riivo Kikas, Marlon Dumas, Dietmar Pfahl Working Conference on Mining Software Repositories . 2016

机译：使用动态和上下文功能来预测GitHub项目中的问题寿命
5. Reliability Improvement on Feasibility Study for Selection of Infrastructure Projects Using Data Mining and Machine Learning [D] . Hu, Xi. 2020

机译：利用数据挖掘和机器学习选择基础设施项目的可行性研究的可靠性改进
6. Comparison of machine learning techniques to predict all-cause mortality using fitness data: the Henry ford exercIse testing (FIT) project [O] . Sherif Sakr, Radwa Elshawi, Amjad M. Ahmed, 2017

机译：使用健身数据预测全因死亡率的机器学习技术比较：Henry Ford运动测试（FIT）项目
7. Developing a machine learning model to predict the construction duration of tall building projects [O] . Muizz O. Sanni-Anibire, Rosli Mohamad Zin, Sunday Olusanya Olatunji 2021

机译：开发机器学习模型以预测高层建筑项目的施工持续时间

Predicting long-time contributors for GitHub projects using machine learning

摘要

著录项

相似文献

相关主题

期刊订阅