首页> 外文会议>International Conference on Multimedia Big Data >Statistical Unigram Analysis for Source Code Repository
【24h】

Statistical Unigram Analysis for Source Code Repository

机译:源代码存储库的统计符号统计分析

获取原文

摘要

Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultra-large source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.
机译:在自然语言处理中,Unigram是n-gram的基本元素。但是,从自然语言语料库收集的字母组合不适合解决计算机编程语言领域的问题。在本文中,我们分析了从超大型源代码存储库中收集的字母组合的性质。具体来说,我们已经从GitHub.com上托管的70万个开源项目中收集了10.1亿个unigram。通过分析这些字母组合,我们发现了有关以下方面的统计模式:(1)开发人员如何命名变量,方法和类,以及(2)开发人员如何选择缩写。我们的研究描述了一种概率模型,用于解决源代码分析中的一个众所周知的问题:如何将给定的缩写扩展到其原始缩进单词。它表明,从源代码存储库收集的会标字母是解决特定于领域的问题的必不可少的资源。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号