首页> 外文期刊>IEEE Transactions on Software Engineering >BinDiffNN: Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences
【24h】

BinDiffNN: Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences

机译:BinDiffNN:学习汇编的分布式表示,以实现针对语义差异的鲁棒二进制差异

获取原文
获取原文并翻译 | 示例
           

摘要

Binary diffing is a process to discover the differences and similarities in functionality between two binary programs. Previous research on binary diffing approaches it as a function matching problem to formulate an initial 1:1 mapping between functions, and later a sequence matching ratio is computed to classify two functions being an exact match, a partial match or no-match . The accuracy of existing techniques is best only when detecting exact matches and they are not efficient in detecting partially changed functions; especially those with minor patches. These drawbacks are due to two major challenges (i) In the 1:1 mapping phase, using a strict policy to match function features (ii) In the classification phase, considering an assembly snippet as a normal text, and using sequence matching for similarity comparison. Instruction has a unique structure i.e. mnemonics and registers have a specific position in instruction and also have a semantic relationship, which makes assembly code different from general text. Sequence matching performs best for general text but it fails to detect structural and semantic changes at an instruction level thus, its use for classification produces many false results. In this research, we have addressed the aforementioned underlying challenges by proposing a two-fold solution. For the 1:1 mapping phase, we have proposed computationally inexpensive features, which are compared with distance-based selection criteria to map similar functions and filter unmatched functions. For the classification phase, we have proposed a Siamese binary-classification neural network where each branch is an attention-based distributed learning embedding neural network — that learn the semantic similarity among assembly instructions, learn to highlight the changes at an instruction level and a final stage fully connected layer learn to accurately classify two 1:1 mapped function either an exact or a partial match. We have used x86 kernel binaries for training and achieved $sim 99%$ classification accuracy; which is higher than existing binary diffing techniques and tools.
机译:二进制比较是发现两个二进制程序之间功能差异和相似之处的过程。以往对二元差分的研究将其视为函数匹配问题,以制定函数之间的初始 1:1 映射,然后计算序列匹配率以将两个函数分类为完全匹配、部分匹配或不匹配。现有技术的准确性仅在检测精确匹配时才最佳,并且在检测部分更改的功能方面效率不高;尤其是那些有小补丁的人。这些缺点是由于两个主要挑战:(i)在1:1映射阶段,使用严格的策略来匹配功能特征(ii)在分类阶段,将程序集片段视为普通文本,并使用序列匹配进行相似性比较。指令具有独特的结构,即助记符和寄存器在指令中具有特定的位置,并且还具有语义关系,这使得汇编代码不同于一般文本。序列匹配对一般文本性能最佳,但它无法在指令级别检测结构和语义变化,因此,将其用于分类会产生许多错误结果。在这项研究中,我们通过提出双重解决方案来解决上述潜在挑战。对于 1:1 映射阶段,我们提出了计算成本低廉的特征,将其与基于距离的选择标准进行比较,以映射相似函数并过滤不匹配的函数。对于分类阶段,我们提出了一个连体二元分类神经网络,其中每个分支都是一个基于注意力的分布式学习嵌入神经网络——它学习汇编指令之间的语义相似性,学习在指令级别和最后阶段的全连接层学习准确地对两个 1:1 映射函数进行分类,无论是精确匹配还是部分匹配。我们使用 x86 内核二进制文件进行训练,并实现了 $sim 99%$ 的分类准确率;这高于现有的二进制差异技术和工具。

著录项

相似文献

  • 外文文献
  • 中文文献
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号