首页> 外文会议>Software Clones (IWSC), 2012 6th International Workshop on >Ctcompare: Code clone detection using hashed token sequences
【24h】

Ctcompare: Code clone detection using hashed token sequences

机译:Ctcompare:使用哈希令牌序列进行代码克隆检测

获取原文
获取原文并翻译 | 示例

摘要

There is much research on the use of tokenized source code to find code clones both within and between trees of source code. Some approaches have used suffix trees [1], [3]; others have used variations of longest common substring algorithms [4], [5]. This paper outlines an algorithm, embodied in a new tool called ctcompare, that takes a different tokenization approach. Each code base to be compared is first lexically analysed to produce a sequence of tokens. These are then broken into overlapping tuples of N consecutive tokens. The tuples are then hashed and the hash values of token tuples are used to identify type-1 and type-2 clone pairs. Hashed token sequences combined with a database have already been used in earlier ctcompare versions and elsewhere [2], but with a significant performance penalty due to database insertions. The benefits of this approach over the existing research include the simultaneous comparison of multiple large code bases and fast absolute performance.
机译:关于使用标记化源代码来查找源代码树内和源代码树之间的代码克隆的研究很多。一些方法使用了后缀树[1],[3]。其他人使用最长的通用子串算法的变体[4],[5]。本文概述了一种算法,该算法包含在称为ctcompare的新工具中,该算法采用了不同的标记化方法。首先对每个要比较的代码库进行词法分析,以生成一系列令牌。然后将它们分解为N个连续令牌的重叠元组。然后,对元组进行哈希处理,并使用令牌元组的哈希值来标识1型和2型克隆对。哈希令牌序列与数据库结合已在早期ctcompare版本和其他地方使用[2],但由于插入数据库而导致性能显着下降。这种方法相对于现有研究的好处包括同时比较多个大型代码库和快速的绝对性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号