首页> 外文期刊>Empirical Software Engineering >What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories
【24h】

What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories

机译:开发人员询问堆栈溢出是否有什么样的问题?自动化方法将帖子分类为问题类别的比较

获取原文
获取原文并翻译 | 示例
           

摘要

On question and answer sites, such as Stack Overflow (SO), developers use tags to label the content of a post and to support developers in question searching and browsing. However, these tags mainly refer to technological aspects instead of the purpose of the question. Tagging questions with their purpose can add a new dimension to the identification of discussed topics in posts on SO. In this paper, we aim at automating the classification of SO question posts into seven question categories. As a first step, we harmonized existing taxonomies of question categories and then, we manually classified 1,000 SO questions according to our new taxonomy. Additionally to the question category, we marked the phrases that indicate a question category for each of the posts. We then use this data set to automate the classification of posts using two approaches. For the first approach, we manually analyzed the phrases to find patterns. Based on regular expressions, we implemented a classifier, for each of the categories, that determines whether a post belongs to a category. These regular expressions are derived by analyzing patterns in the phrases. In the second approach, we use the curated data set to train classification models of supervised machine learning algorithms (Random Forest and Support Vector Machines). For the machine learning algorithms, we experimented with 1,312 different configurations regarding the preprocessing of the text and the representation of the input data. Then, we compared the performance of the regex approach with the performance of the best configuration that uses machine learning algorithms on a validation set of 110 posts. The results show that using the regular expression approach, we can classify posts into the correct question category with an average precision and recall of 0.90, and an MCC of 0.68. Additionally, we applied the regex approach on all questions of SO that deal with Android app development and investigated the co-occurrence of question categories in posts. We found that the categories API USAGE, CONCEPTUAL, and DISCREPANCY are the most frequently assigned question categories and that they also occur together frequently. Our approach can be used to support developers in browsing SO discussions or researchers in building recommender systems based on SO.
机译:在问题和答案网站(如堆栈溢出(SO),开发人员使用标签来标记帖子的内容,并支持有关搜索和浏览的开发人员。但是,这些标签主要是指技术方面而不是问题的目的。标记其目的的问题可以为识别讨论的帖子中所讨论的主题添加新的维度。在本文中,我们的目标是自动化所谓的帖子分类为七个问题类别。作为第一步,我们统一了现有的问题类别,然后,我们根据我们的新分类法手动分类了1,000个问题。此外,对于问题类别,我们标记了指示每个帖子的问题类别的短语。然后,我们使用此数据设置以自动使用两种方法自动分类帖子。对于第一种方法,我们手动分析了要查找模式的短语。基于正则表达式,我们为每个类别实现了一个分类器,该分类器确定帖子是否属于类别。通过分析短语中的模式来派生这些正则表达式。在第二种方法中,我们使用策划数据集来培训监督机器学习算法(随机林和支持向量机)的分类模型。对于机器学习算法,我们尝试了关于文本预处理和输入数据的表示的1,312种不同的配置。然后,我们将正则表达式方法的性能与在验证组的验证集上使用机器学习算法的性能进行了比较。结果表明,使用正则表达方式,我们可以将帖子分类为正确的问题类别,平均精度和召回0.90,MCC为0.68。此外,我们将正则表达式方法应用于所有问题,如此处理Android应用程序开发,并调查了帖子中的问题类别的共同发生。我们发现类别API使用,概念和差异是最常见的问题类别,并且它们也经常一起发生。我们的方法可用于支持基于所以基于所以构建推荐系统的讨论或研究人员的开发人员。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号