首页> 外文会议>2018 IEEE/ACM 40th International Conference on Software Engineering: New Ideas and Emerging Technologies Results >Hierarchical Learning of Cross-Language Mappings Through Distributed Vector Representations for Code
【24h】

Hierarchical Learning of Cross-Language Mappings Through Distributed Vector Representations for Code

机译:通过代码的分布式矢量表示分层学习跨语言映射

获取原文
获取原文并翻译 | 示例

摘要

Translating a program written in one programming language to another can be useful for software development tasks that need functionality implementations in different languages. Although past studies have considered this problem, they may be either specific to the language grammars, or specific to certain kinds of code elements (e.g., tokens, phrases, API uses). This paper proposes a new approach to automatically learn cross-language representations for various kinds of structural code elements that may be used for program translation. Our key idea is two folded: First, we normalize and enrich code token streams with additional structural and semantic information, and train cross-language vector representations for the tokens (a.k.a. shared embeddings based on word2vec, a neural-network-based technique for producing word embeddings; Second, hierarchically from bottom up, we construct shared embeddings for code elements of higher levels of granularity (e.g., expressions, statements, methods) from the embeddings for their constituents, and then build mappings among code elements across languages based on similarities among embeddings. Our preliminary evaluations on about 40,000 Java and C# source files from 9 software projects show that our approach can automatically learn shared embeddings for various code elements in different languages and identify their cross-language mappings with reasonable Mean Average Precision scores. When compared with an existing tool for mapping library API methods, our approach identifies many more mappings accurately. The mapping results and code can be accessed at https://github.com/bdqnghi/hierarchical-programming-language-mapping) We believe that our idea for learning cross-language vector representations with code structural information can be a useful step towards automated program translation.
机译:将以一种编程语言编写的程序转换为另一种编程语言,对于需要使用不同语言进行功能实现的软件开发任务很有用。尽管过去的研究已经考虑了这个问题,但是它们可能特定于语言语法,或者特定于某些类型的代码元素(例如,令牌,短语,API使用)。本文提出了一种新的方法,可以自动学习可能用于程序翻译的各种结构代码元素的跨语言表示形式。我们的关键思想有两个方面:首先,我们使用其他结构和语义信息来规范化和丰富代码令牌流,并训练令牌的跨语言矢量表示(又名基于word2vec的共享嵌入,这是一种基于神经网络的技术,用于产生单词嵌入;其次,从下至上,从层次结构上,我们从更高层次的代码元素(例如,表达式,语句,方法)的构成中构造它们的共享嵌入,然后基于相似性在跨语言的代码元素之间建立映射我们对来自9个软件项目的大约40,000个Java和C#源文件进行了初步评估,结果表明,我们的方法可以自动学习不同语言的各种代码元素的共享嵌入,并以合理的平均平均精度得分来识别它们的跨语言映射。借助用于映射库API方法的现有工具,我们的方法可以确定准确地绘制更多的映射。可以在以下网址访问映射结果和代码:https://github.com/bdqnghi/hierarchical-programming-language-mapping)我们认为,学习带有代码结构信息的跨语言矢量表示的想法可能是迈向自动化的有用步骤程序翻译。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号