首页> 外文期刊>Empirical Software Engineering >Siamese: scalable and incremental code clone search via multiple code representations
【24h】

Siamese: scalable and incremental code clone search via multiple code representations

机译:暹罗语:通过多种代码表示形式进行可伸缩和增量式代码克隆搜索

获取原文
获取原文并翻译 | 示例
       

摘要

This paper presents a novel code clone search technique that is accurate, incremental, and scalable to hundreds of million lines of code. Our technique incorporates multiple code representations (i.e., a technique to transform code into various representations to capture different types of clones), query reduction (i.e., a technique to select clone search keywords based on their uniqueness), and a customised ranking function (i.e., a technique to allow a specific clone type to be ranked on top of the search results) to improve clone search performance. We implemented the technique in a clone search tool, called Siamese, and evaluated its search accuracy and scalability on three established clone data sets. Siamese offers the highest mean average precision of 95% and 99% on two clone benchmarks compared to seven state-of-the-art clone detection tools, and reported the largest number of Type-3 clones compared to three other code search engines. Siamese is scalable and can return cloned code snippets within 8 seconds for a code corpus of 365 million lines of code. Using an index of 130,719 GitHub projects, we demonstrate that Siamese's incremental indexing capability dramatically decreases the index preparation time for large-scale data sets with multiple releases of software projects. The paper discusses the applications of Siamese to facilitate software development and research with two use cases including online code clone detection and clone search with automated license analysis.
机译:本文提出了一种新颖的代码克隆搜索技术,该技术准确,递增且可扩展到亿万行代码。我们的技术结合了多种代码表示形式(即,一种将代码转换成各种表示形式以捕获不同类型的克隆的技术),查询减少(即一种基于其唯一性来选择克隆搜索关键字的技术)和一个自定义排名功能(即, ,一种允许将特定克隆类型放在搜索结果顶部的技术),以提高克隆搜索性能。我们在名为Siamese的克隆搜索工具中实施了该技术,并在三个已建立的克隆数据集上评估了其搜索准确性和可扩展性。与七个最新的克隆检测工具相比,Siamese在两个克隆基准上的平均平均精度最高,分别为95%和99%,与其他三个代码搜索引擎相比,Siamese报告的Type-3克隆数量最多。 Siamese具有可伸缩性,可以在8秒内返回克隆代码段,从而生成3.65亿行代码的代码库。使用130,719个GitHub项目的索引,我们证明了Siamese的增量索引功能极大地减少了具有多个版本的软件项目的大规模数据集的索引准备时间。本文通过两个用例(包括在线代码克隆检测和带有自动许可证分析的克隆搜索)讨论了暹罗语在促进软件开发和研究中的应用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号