首页> 外文会议>IEEE/ACM International Conference on Mining Software Repositories >Natural Language or Not (NLoN) - A Package for Software Engineering Text Analysis Pipeline
【24h】

Natural Language or Not (NLoN) - A Package for Software Engineering Text Analysis Pipeline

机译:是否使用自然语言(NLoN)-软件工程文本分析管道的软件包

获取原文

摘要

The use of natural language processing (NLP) is gaining popularity in software engineering. In order to correctly perform NLP, we must pre-process the textual information to separate natural language from other information, such as log messages, that are often part of the communication in software engineering. We present a simple approach for classifying whether some textual input is natural language or not. Although our NLoN package relies on only 11 language features and character tri-grams, we are able to achieve an area under the ROC curve performances between 0.976-0.987 on three different data sources, with Lasso regression from Glmnet as our learner and two human raters for providing ground truth. Cross-source prediction performance is lower and has more fluctuation with top ROC performances from 0.913 to 0.980. Compared with prior work, our approach offers similar performance but is considerably more lightweight, making it easier to apply in software engineering text mining pipelines. Our source code and data are provided as an R-package for further improvements.
机译:自然语言处理(NLP)的使用在软件工程中越来越受欢迎。为了正确执行NLP,我们必须对文本信息进行预处理,以将自然语言与其他信息(例如日志消息)分开,这些信息通常是软件工程中通信的一部分。我们提供了一种简单的方法来对某些文本输入是否为自然语言进行分类。尽管我们的NLoN软件包仅依赖于11种语言特征和字符三元组,但我们能够在三种不同的数据源上实现ROC曲线性能介于0.976-0.987之间的区域,而Glmnet作为我们的学习者和两名人类评估者则进行了Lasso回归提供基本事实。跨源预测性能较低,并且ROC最高性能从0.913到0.980有较大波动。与以前的工作相比,我们的方法提供了相似的性能,但重量更轻,这使得在软件工程文本挖掘管道中更容易应用。我们的源代码和数据作为R包提供,用于进一步改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号