首页> 外文会议>IEEE International Conference on Software Analysis, Evolution and Reengineering >Two-Stage Attention-Based Model for Code Search with Textual and Structural Features
【24h】

Two-Stage Attention-Based Model for Code Search with Textual and Structural Features

机译:基于两个阶段关注的代码搜索模型,具有文本和结构特征

获取原文

摘要

Searching and reusing existing code from a large scale codebase can largely improve developers’ programming efficiency. To support code reuse, early code search models leverage information retrieval (IR) techniques to index a large-scale code corpus and return relevant code according to developers’ search query. However, IR-based models fail to capture the semantics in code and query. To tackle this issue, developers applied deep learning (DL) techniques to code search models. However, these models either are too complex to determine an effective method efficiently or learning for semantic correlation between code and query inadequately.To bridge the semantic gap between code and query effectively and efficiently, we propose a code search model TabCS (Two-stage Attention-Based model for Code Search) in this study. TabCS extracts code and query information from the code textual features (i.e., method name, API sequence, and tokens), the code structural feature (i.e., abstract syntax tree), and the query feature (i.e., tokens). TabCS performs a two-stage attention net-work structure. The first stage leverages attention mechanisms to extract semantics from code and query considering their semantic gap. The second stage leverages a co-attention mechanism to capture their semantic correlation and learn better code/query representation. We evaluate the performance of TabCS on two existing large-scale datasets with 485k and 542k code snippets, respectively. Experimental results show that TabCS achieves an MRR of 0.57 on Hu et al.’s dataset, outperforming three state-of-the-art models CARLCS-CNN, DeepCS, and UNIF by 18%, 70%, 12%, respectively. Meanwhile, TabCS gains an MRR of 0.54 on Husain et al.’s, outperforming CARLCS-CNN, DeepCS, and UNIF by 32%, 76%, 29%, respectively.
机译:从大规模的CodeBase搜索和重用现有代码可能主要提高开发人员的编程效率。为了支持代码重用,早期代码搜索模型利用信息检索(IR)技术来索引大规模代码语料库并根据开发人员搜索查询返回相关代码。但是,基于IR的模型无法捕获代码和查询中的语义。为了解决这个问题,开发人员将深度学习(DL)技术应用于代码搜索模型。然而,这些模型要么过于复杂,无法有效地确定有效的方法或学习代码和查询之间的语义相关性。要有效且有效地桥接代码和查询之间的语义差距,我们提出了一种代码搜索模型Tabcs(两级注意基于码搜索的模型)在本研究中。 Tabcs从代码文本特征(即,方法名称,API序列和令牌),代码结构特征(即,抽象语法树)和查询功能(即,令牌)中提取代码和查询信息。 Tabcs执行两级注意净工作结构。第一阶段利用注意机制从代码和查询中提取语义,考虑到它们的语义差距。第二阶段利用共关注机制来捕获它们的语义相关性并学习更好的代码/查询表示。我们分别评估了Tabcs对具有485k和542k代码片段的两个现有大型数据集的表现。实验结果表明,Tabcs达到了0.57的HU等人的MRR。的数据集,优于三种最先进的模型Carlcs-CNN,Deepc和Unif,分别为18%,70%,12%。同时,Tabcs在Husain等人上获得了0.54的MRR。,表现出Carlcs-CNN,Deepc和Unif的表现优于32%,76%,29%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号