Two-Stage Attention-Based Model for Code Search with Textual and Structural Features

机译：基于两个阶段关注的代码搜索模型，具有文本和结构特征

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Searching and reusing existing code from a large scale codebase can largely improve developers’ programming efficiency. To support code reuse, early code search models leverage information retrieval (IR) techniques to index a large-scale code corpus and return relevant code according to developers’ search query. However, IR-based models fail to capture the semantics in code and query. To tackle this issue, developers applied deep learning (DL) techniques to code search models. However, these models either are too complex to determine an effective method efficiently or learning for semantic correlation between code and query inadequately.To bridge the semantic gap between code and query effectively and efficiently, we propose a code search model TabCS (Two-stage Attention-Based model for Code Search) in this study. TabCS extracts code and query information from the code textual features (i.e., method name, API sequence, and tokens), the code structural feature (i.e., abstract syntax tree), and the query feature (i.e., tokens). TabCS performs a two-stage attention net-work structure. The first stage leverages attention mechanisms to extract semantics from code and query considering their semantic gap. The second stage leverages a co-attention mechanism to capture their semantic correlation and learn better code/query representation. We evaluate the performance of TabCS on two existing large-scale datasets with 485k and 542k code snippets, respectively. Experimental results show that TabCS achieves an MRR of 0.57 on Hu et al.’s dataset, outperforming three state-of-the-art models CARLCS-CNN, DeepCS, and UNIF by 18%, 70%, 12%, respectively. Meanwhile, TabCS gains an MRR of 0.54 on Husain et al.’s, outperforming CARLCS-CNN, DeepCS, and UNIF by 32%, 76%, 29%, respectively.

机译：从大规模的CodeBase搜索和重用现有代码可能主要提高开发人员的编程效率。为了支持代码重用，早期代码搜索模型利用信息检索（IR）技术来索引大规模代码语料库并根据开发人员搜索查询返回相关代码。但是，基于IR的模型无法捕获代码和查询中的语义。为了解决这个问题，开发人员将深度学习（DL）技术应用于代码搜索模型。然而，这些模型要么过于复杂，无法有效地确定有效的方法或学习代码和查询之间的语义相关性。要有效且有效地桥接代码和查询之间的语义差距，我们提出了一种代码搜索模型Tabcs（两级注意基于码搜索的模型）在本研究中。 Tabcs从代码文本特征（即，方法名称，API序列和令牌），代码结构特征（即，抽象语法树）和查询功能（即，令牌）中提取代码和查询信息。 Tabcs执行两级注意净工作结构。第一阶段利用注意机制从代码和查询中提取语义，考虑到它们的语义差距。第二阶段利用共关注机制来捕获它们的语义相关性并学习更好的代码/查询表示。我们分别评估了Tabcs对具有485k和542k代码片段的两个现有大型数据集的表现。实验结果表明，Tabcs达到了0.57的HU等人的MRR。的数据集，优于三种最先进的模型Carlcs-CNN，Deepc和Unif，分别为18％，70％，12％。同时，Tabcs在Husain等人上获得了0.54的MRR。，表现出Carlcs-CNN，Deepc和Unif的表现优于32％，76％，29％。

著录项

来源
《IEEE International Conference on Software Analysis, Evolution and Reengineering》|2021年|342-353|共12页
会议地点
作者
Ling Xu; Huanhuan Yang; Chao Liu; Jianhang Shuai; Meng Yan; Yan Lei; Zhou Xu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Deep learning; Correlation; Semantics; Syntactics; Programming; Feature extraction; Information retrieval;

机译：深入学习;相关;语义;语法;编程;特征提取;信息检索;

相似文献

外文文献
中文文献
专利

1. Novel textual features for language modeling of intra-sentential code-switching data [J] . Sreeram Ganji, Kunal Dhawan, Rohit Sinha Computer speech and language . 2020,第Nova期

机译：语言建模的新型文本特征 - 句子码切换数据的语言建模
2. A hierarchical temporal attention-based LSTM encoder-decoder model for individual mobility prediction [J] . Li Fa, Gui Zhipeng, Zhang Zhaoyu, Neurocomputing . 2020,第Auga25期

机译：基于分层时间关注的单个移动预测的LSTM编码器 - 解码器模型
3. An attention-based row-column encoder-decoder model for text recognition in Japanese historical documents [J] . Nam Tuan Ly, Cuong Tuan Nguyen, Masaki Nakagawa Pattern recognition letters . 2020,第Auga期

机译：基于注意力的行列编码器 - 解码器模型，用于日语历史文档中的文本识别
4. Hybrid storage for enabling fully-featured text search and fine-grained structural search over source code [C] . Panchenko O. Search-Driven Development-Users, Infrastructure, Tools and Evaluation, 2009. SUITE '09 . 2009

机译：混合存储，用于对源代码进行全功能的文本搜索和细粒度的结构搜索
5. XML document classification using structural and textual features. [D] . Khabbazhaye Tajer, Mohammad. 2008

机译：使用结构和文本功能对XML文档进行分类。
6. In search of function for hypothetical proteins encoded by genes of SA-JA pathways in Oryza sativa by in silico comparison and structural modeling [O] . Indra Singh, Pragati Agrawal, Kavita Shah 2012

机译：通过计算机比较和结构建模来寻找水稻SA-JA途径基因编码的假设蛋白质的功能
7. Hybrid Storage for Enabling Fully-Featured Text Search and Fine-Grained Structural Search over Source Code [O] . Oleksandr Panchenko 2015

机译：混合存储，用于通过源代码实现全功能文本搜索和细粒度结构搜索

Two-Stage Attention-Based Model for Code Search with Textual and Structural Features

摘要

著录项

相似文献

相关主题

期刊订阅