Statistical Unigram Analysis for Source Code Repository

Weifeng Xu; Dianxiang Xu; Abdulrahman Alatawi; Omar El Ariss; Yunkai Liu

首页> 外文期刊>International journal of semantic computing >Statistical Unigram Analysis for Source Code Repository

【24h】

Statistical Unigram Analysis for Source Code Repository

机译：源代码存储库的统计UNIGRAM分析

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultra-large source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical properties regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. We describe a probabilistic model which relies on these properties for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. Our empirical study shows that using the unigrams extracted from source code repository outperforms the using of the natural language corpus by 21% when solving the domain specific problems.

机译：Uniagram是自然语言处理中n-gram的基本要素。然而，从自然语言语料库中收集的UNIGRAM不适合解决计算机编程语言领域的问题。在本文中，我们分析了从超大型源代码库收集的UNIGRAMS的属性。具体而言，我们从Github.com托管的070万开源项目中收集了1201亿卢比的UNIGRAM。通过分析这些Unigrams，我们已经发现了关于（1）开发人员名称变量，方法和类的统计属性，以及（2）开发人员如何选择缩写。我们描述了一个概率模型，依赖于这些属性来解决源代码分析中的众所周知的问题：如何将给定的缩写扩展到其原始缩进字。我们的实证研究表明，使用从源代码库中提取的UNIGRAMS在解决域特定问题时，使用21％的自然语言语料库的使用优于使用。

著录项

来源
《International journal of semantic computing》 |2018年第2期|共24页
作者
Weifeng Xu; Dianxiang Xu; Abdulrahman Alatawi; Omar El Ariss; Yunkai Liu;
展开▼
作者单位

1Department of Computer Science Bowie State University Bowie Maryland USA;

2Department of Computer Science Boise State University Boise Idaho USA;

1Department of Computer Science Bowie State University Bowie Maryland USA;

3Department of Computer Science Texas A&

M University Commerce TX 75428 USA;

4Department of Computer &

Information Science Gannon University Erie Pennsylvania USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
Programming language; source code; n-gram; unigram; abbreviations; ultra-large-scale analysis;

机译：编程语言;源代码;n-gram;unigram;缩写;超大规模分析;

相似文献

外文文献
中文文献
专利

1. Statistical Unigram Analysis for Source Code Repository [J] . Weifeng Xu, Dianxiang Xu, Abdulrahman Alatawi, International journal of semantic computing . 2018,第2期

机译：源代码存储库的统计UNIGRAM分析
2. Determination of Repository Loading Values in Fuel Cycle Scenario Analysis Codes [J] . Tracy E. Radel, Paul P.H. Wilson Transactions of the American nuclear society . 2007,第期

机译：确定燃料循环情景分析代码中的存储库装载值
3. Online data repositories as educational resources? A learning environment covering formal and informal inferential statistics ideas in scientific inquiry [J] . Thomas Schubatzky, Claudia Haagen-Schützenh?fer European journal of physics: A journal of the European Physical Society . 2019,第4期

机译：在线数据存储库作为教育资源？一个学习环境，涵盖科学调查中正式和非正式的推理统计思想
4. Statistical Unigram Analysis for Source Code Repository [C] . Dianxiang Xu, Omar El Ariss, Yunkai Liu, International Conference on Multimedia Big Data . 2017

机译：源代码存储库的统计符号统计分析
5. Analysing source code structure and mining software repositories to create requirements traceability links [D] . Ali, Nasir 2012

机译：分析源代码结构和挖掘软件存储库以创建需求可追溯性链接
6. An open data repository for steady state analysis of a 100-node electricity distribution network with moderate connection of renewable energy sources [O] . Stavros Lazarou, Vasiliki Vita, Lambros Ekonomou 2018

机译：一个开放数据存储库用于对具有适度可再生能源连接的100节点配电网络进行稳态分析
7. AN EMPIRICAL ANALYSIS OF THE OPEN SOURCE DEVELOPMENT PROCESS BASED ON MINING OF SOURCE CODE REPOSITORIES [O] . Marco Scotto, Alberto Sillitti, Giancarlo Succi 2011

机译：基于源代码存储库挖掘的开源开发过程的实证分析
8. AREST (Analytical Repository Source-Term): A Probabilistic Source-Term Code for Waste Package Performance Analysis [R] . Liebetrau, A. M. , Apted, M. J. , Engel, D. W. , 1987

机译：aREsT（分析存储库源项）：废物包装性能分析的概率源项代码

Statistical Unigram Analysis for Source Code Repository

摘要

著录项

相似文献

相关主题

期刊订阅