【24h】

A language model for statements of software code

机译:用于软件代码声明的语言模型

获取原文

摘要

Building language models for source code enables a large set of improvements on traditional software engineering tasks. One promising application is automatic code completion. State-of-the-art techniques capture code regularities at token level with lexical information. Such language models are more suitable for predicting short token sequences, but become less effective with respect to long statement level predictions. In this paper, we have proposed PCC to optimize the token-level based language modeling. Specifically, PCC introduced an intermediate representation (IR) for source code, which puts tokens into groups using lexeme and variable relative order. In this way, PCC is able to handle long token sequences, i.e., group sequences, to suggest a complete statement with the precise synthesizer. Further more, PCC employed a fuzzy matching technique which combined genetic and longest common subsequence algorithms to make the prediction more accurate. We have implemented a code completion plugin for Eclipse and evaluated it on open-source Java projects. The results have demonstrated the potential of PCC in generating precise long statement level predictions. In 30%-60% of the cases, it can correctly suggest the complete statement with only six candidates, and 40%-90% of the cases with ten candidates.
机译:为源代码构建语言模型可以对传统软件工程任务进行大量改进。一种有前途的应用是自动代码完成。最新技术利用词汇信息在令牌级别捕获代码规则性。这样的语言模型更适合于预测短标记序列,但相对于长语句级别的预测而言效果较差。在本文中,我们提出了PCC来优化基于令牌级别的语言建模。具体来说,PCC引入了一种用于源代码的中间表示(IR),该中间表示使用lexeme和变量相对顺序将令牌分组。这样,PCC能够处理较长的令牌序列(即组序列),以使用精确的合成器建议完整的语句。此外,PCC采用了一种模糊匹配技术,该技术结合了遗传算法和最长的通用子序列算法,从而使预测更加准确。我们已经为Eclipse实现了一个代码完成插件,并在开源Java项目中对其进行了评估。结果证明了PCC在生成精确的长语句级别预测中的潜力。在30 %%-60%的案例中,它可以正确地建议只有六个候选者的完整陈述,而在40 %%-90 \%的案例中只有十个候选者。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号