...
首页> 外文期刊>Data Mining and Knowledge Discovery >Sourcerer: mining and searching internet-scale software repositories
【24h】

Sourcerer: mining and searching internet-scale software repositories

机译:Sourcerer:挖掘和搜索互联网规模的软件存储库

获取原文
获取原文并翻译 | 示例
           

摘要

Large repositories of source code available over the Internet, or within large organizations, create new challenges and opportunities for data mining and statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, fingerprinting, and database storage of open source software on an Internet-scale. In one experiment, we gather 4,632 Java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, method call, and lexical containment distributions. We then develop and apply unsupervised, probabilistic, topic and author-topic (AT) models to automatically discover the topics embedded in the code and extract topic-word, document-topic, and AT distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing source file similarity, developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering an software development staffing. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the area under the curve (AUC) retrieval metric to 0.92– roughly 10–30% better than previous approaches based on text alone. A prototype of the system is available at: http://sourcerer.ics.uci.edu.
机译:Internet上或大型组织内可用的大型源代码存储库为数据挖掘和统计机器学习带来了新的挑战和机遇。在这里,我们首先开发Sourcerer,这是一种基础结构,用于在Internet规模上自动爬行,解析,指纹识别和开源软件的数据库存储。在一个实验中,我们从SourceForge和Apache收集了4,632个Java项目,总计来自9,250名开发人员的超过3,800万行代码。简单的数据统计分析首先揭示了包,方法调用和词法包含分布的强大幂律行为。然后,我们开发并应用无监督,概率,主题和作者主题(AT)模型,以自动发现代码中嵌入的主题,并提取主题词,文档主题和AT分布。除了作为程序功能和开发人员活动的方便摘要之外,这些以及其他相关的发行版还提供了统计和信息理论基础,可以直接量化和分析源文件的相似性,开发人员的相似性和能力,主题分散以及文档纠结软件工程应用软件开发人员配备。最后,通过将软件文本内容与我们的CodeRank方法捕获的结构信息相结合,我们能够显着改善软件检索性能,将曲线下面积(AUC)检索指标提高到0.92,比以前的方法提高了约10-30%仅在文字上。该系统的原型可从以下网站获得:http://sourcerer.ics.uci.edu。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号