首页> 外文会议>IEEE/ACM International Conference on Mining Software Repositories >Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow
【24h】

Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

机译:从堆栈溢出中学习对齐的代码和自然语言对

获取原文

摘要

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. StackOverflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g. pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.
机译:对于诸如从自然语言进行代码合成,代码检索和代码摘要之类的任务,数据驱动模型显示出了巨大的希望。但是,创建这些模型需要自然语言(NL)和代码之间具有细粒度对齐方式的并行数据。 StackOverflow(SO)是创建此类数据集的一个有希望的来源:问题是多种多样的,并且大多数问题都有相应的答案以及高质量的代码片段。但是,现有的启发式方法(例如,将帖子的标题与接受的答案中的代码配对)在覆盖范围和所获得的NL代码对的正确性方面都受到限制。在本文中,我们提出了一种使用两组特征从SO挖掘高质量对齐数据的新方法:考虑到摘录片段的结构的手工特征以及通过训练概率模型以捕获两者之间的相关性而获得的对应特征使用神经网络进行NL和编码。这些特征被输入到分类器中,该分类器确定所开采的NL代码对的质量。使用Python和Java作为测试平台的实验表明,即使仅使用少量带标签的示例,该方法也大大扩展了现有挖掘方法的覆盖范围和准确性。此外,我们发现即使在一种语言上训练分类器并在另一种语言上进行测试时,也可以获得合理的结果,这显示了将NL代码挖掘扩展到除我们能够注释数据的语言之外的多种编程语言的希望。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号