首页> 外文学位 >Feature-based analysis of open source using big data analytics.
【24h】

Feature-based analysis of open source using big data analytics.

机译:使用大数据分析对开源进行基于功能的分析。

获取原文
获取原文并翻译 | 示例

摘要

The open source code base has increased enormously and hence understanding the functionality of the projects has become extremely difficult. The existing approaches of feature discovery that aim to identify functionality are typically semi-automatic and often require human intervention. In this thesis, an innovative framework is proposed for automatic discovery of features and the respective components for any open source project dynamically using Machine Learning. The overall goal of the approach is to create an automated and scalable model which produces accurate results.;The initial step is to extract the meta-data and perform pre-processing. The next step is to dynamically discover topics using Latent Dirichlet Allocation and to form components optimally using K-Means. The final step is to discover the features implemented in the components using Term Frequency - Inverse Document Frequency algorithm. This framework is implemented in Spark that is a fast and parallel processing engine for big data analytics. ArchStudio tool is used to visualize the features to class mapping functionality. As a case study, Apache Solr and Apache Hadoop HDFS are used to illustrate the automatic discovery of components and features. We demonstrated the scalabilty and the accuracy of our proposed model compared with a manual evaluation by software architecture experts as a baseline. The accuracy is 85% when compared with the manual evaluation of Apache Solr. In addition, many new features were discovered for both the case studies through the automated framework.
机译:开源代码库已大大增加,因此了解项目的功能变得极为困难。旨在识别功能的现有特征发现方法通常是半自动的,通常需要人工干预。本文提出了一种创新的框架,该框架可以使用机器学习动态地自动发现任何开源项目的功能及其各个组件。该方法的总体目标是创建一个自动且可扩展的模型,以产生准确的结果。第一步是提取元数据并执行预处理。下一步是使用潜在Dirichlet分配动态发现主题,并使用K-Means最佳地形成组件。最后一步是使用术语频率-反向文档频率算法发现组件中实现的功能。此框架在Spark中实现,Spark是用于大数据分析的快速并行处理引擎。 ArchStudio工具用于可视化要素到类的映射功能。作为案例研究,Apache Solr和Apache Hadoop HDFS用于说明组件和功能的自动发现。我们证明了我们提出的模型的可扩展性和准确性,并以软件体系结构专家的手动评估为基准。与Apache Solr的手动评估相比,准确性为85%。此外,通过自动化框架为这两个案例研究发现了许多新功能。

著录项

  • 作者

    Krishnan, Malathy.;

  • 作者单位

    University of Missouri - Kansas City.;

  • 授予单位 University of Missouri - Kansas City.;
  • 学科 Computer science.
  • 学位 M.S.
  • 年度 2015
  • 页码 88 p.
  • 总页数 88
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号