首页> 外文会议>European symposium on research in computer security >Source Code Authorship Attribution Using Long Short-Term Memory Based Networks
【24h】

Source Code Authorship Attribution Using Long Short-Term Memory Based Networks

机译:使用基于长期短期记忆的网络的源代码作者身份归属

获取原文

摘要

Machine learning approaches to source code authorship attribution attempt to find statistical regularities in human-generated source code that can identify the author or authors of that code. This has applications in plagiarism detection, intellectual property infringement, and post-incident forensics in computer security. The introduction of features derived from the Abstract Syntax Tree (AST) of source code has recently set new benchmarks in this area, significantly improving over previous work that relied on easily obfuscatable lexical and format features of program source code. However, these AST-based approaches rely on hand-constructed features derived from such trees, and often include ancillary information such as function and variable names that may be obfuscated or manipulated. In this work, we provide novel contributions to AST-based source code authorship attribution using deep neural networks. We implement Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM) models to automatically extract relevant features from the AST representation of programmers' source code. We show that our models can automatically learn efficient representations of AST-based features without needing hand-constructed ancillary information used by previous methods. Our empirical study on multiple datasets with different programming languages shows that our proposed approach achieves the state-of-the-art performance for source code authorship attribution on AST-based features, despite not leveraging information that was previously thought to be required for high-confidence classification.
机译:机器学习对源代码作者身份进行归因的方法试图在人为生成的源代码中找到可以识别该代码的作者的统计规律。这在窃检测,知识产权侵权和计算机安全事件后取证中具有应用。从源代码的抽象语法树(AST)派生的功能的引入最近在这一领域树立了新的基准,相对于以前的工作(依赖于程序源代码的容易混淆的词法和格式功能),该功能有了显着改进。但是,这些基于AST的方法依赖于从此类树派生的手工构造特征,并且通常包括辅助信息,例如可能被混淆或操纵的功能和变量名。在这项工作中,我们使用深度神经网络为基于AST的源代码作者归属提供了新颖的贡献。我们实现了长期短期内存(LSTM)和双向长期短期内存(BiLSTM)模型,以自动从程序员源代码的AST表示中提取相关功能。我们表明,我们的模型可以自动学习基于AST的功能的有效表示,而无需先前方法使用的手工构造的辅助信息。我们对使用不同编程语言的多个数据集进行的实证研究表明,尽管没有利用以前认为的高水平信息所必需的信息,但我们提出的方法在基于AST的功能上实现了源代码作者身份归属的最新性能。置信度分类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号