...
首页> 外文期刊>Information and software technology >Predicting long-time contributors for GitHub projects using machine learning
【24h】

Predicting long-time contributors for GitHub projects using machine learning

机译:使用机器学习预测Github项目的长期贡献者

获取原文
获取原文并翻译 | 示例
           

摘要

Context: Many organizations develop software systems using open source software (OSS), which is risky due to the high possibility of losing support. Contributors are critical for the survival of OSS projects, but very few new contributors remain with OSS projects to become long-time contributors (LTCs). Identification of factors that contribute to become an LTC can help OSS project owners utilize limited resources to retain new contributors. Objective: In this paper, we investigate whether we can effectively predict new contributors to OSS repos-itories becoming long time contributors based on repository and contributor meta-data collected from GitHub. Method: We construct a dataset containing 70,899 observations from 888 most popular repositories with 56,766 contributors. Each observation represents a contributor who joined the repository and is categorized as either an LTC or a non-LTC, depending on whether their project tenure is longer than 3 years. Each observation has 31 features that are calculated using the information of the new contributor and the repository when a new contributor joins the project. We build several machine learning models, including naive Bayes, k-nearest neighbor, logistic regression, decision tree, and random forest to predict LTC validated using 10-fold cross-validation. We compare our best model with state of the art model in terms of precision, recall, F1-score, Matthews correlation coefficient (MCC), and area under the curve (AUC). Results: In 10-fold cross-validation, the precision, recall, F1-score, MCC, and AUC of our best model (random forest) are 0.695, 0.079, 0.140, 0.226, and 0.913, respectively. These values are 27.29%, 92.68%, 86.67%, 56.94%, and 0.55%, respectively better than the best baseline state of the art model (random forest). Conclusion: Compared to state of the art models, the models built using our approach use less than 50% features (31 vs 63), have no wait time of one month after the contributor joins to predict future LTC status, and produce better results.
机译:背景信息:许多组织使用开源软件(OSS)开发软件系统,这是由于丢失支持的高可能性而导致的风险。贡献者对于OSS项目的生存至关重要,但很少有新的贡献者仍然存在OSS项目成为长期贡献者(LTCS)。确定有助于成为LTC的因素可以帮助OSS项目业主利用有限的资源来保留新贡献者。目的:在本文中,我们调查了我们是否能够有效地预测OSS呼气课程的新贡献者是基于从GitHub收集的存储库和贡献者元数据成为长期贡献者。方法:我们构建一个数据集,其中包含70,899个最受欢迎的存储库的观察,其中包含56,766个贡献者。每个观察代表了加入存储库的贡献者,并根据其项目任期是否超过3年而被分类为LTC或非LTC。每次观察都有31个功能,使用新贡献者和存储库的信息计算,当新贡献者加入项目时。我们构建了几种机器学习模型,包括天真贝叶斯,k最近邻居,逻辑回归,决策树和随机森林,以预测使用10倍交叉验证验证的LTC。我们在精确度,召回,F1分数,马太基相关系数(MCC)和曲线下的区域(AUC)方面,将我们的最佳模型与最新的型号进行比较。结果:在10倍交叉验证中,我们最佳型号(随机林)的精确度,召回,F1分,MCC和AUC分别为0.695,0.079,0.140,0.226和0.913。这些值分别优于最佳基线(随机林)的最佳基线状态,这些值为27.29%,92.68%,86.67%,56.94%和0.55%。结论:与现有技术模型相比,使用我们的方法使用的模型使用少于50%的功能(31 VS 63),在贡献者加入以预测未来LTC状态后没有等待时间,并产生更好的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号