首页> 外文期刊>Science of Computer Programming >Mining structured data in natural language artifacts with island parsing
【24h】

Mining structured data in natural language artifacts with island parsing

机译:通过孤岛解析以自然语言工件挖掘结构化数据

获取原文
获取原文并翻译 | 示例

摘要

Software repositories typically store data composed of structured and unstructured parts. Researchers mine this data to empirically validate research ideas and to support practitioners' activities. Structured data (e.g., source code) has a formal syntax and is straightforward to analyze; unstructured data (e.g., documentation) is a mix of natural language, noise, and snippets of structured data, and it is harder to analyze. Especially the structured content (e.g., code snippets) in unstructured data contains valuable information. Researchers have proposed several approaches to recognize, extract, and analyze structured data embedded in natural language. We analyze these approaches and investigate their drawbacks. Subsequently, we present two novel methods, based on scannerless generalized LR (SGLR) and Parsing Expression Grammars (PEGs), to address these drawbacks and to mine structured fragments within unstructured data. We validate and compare these approaches on development emails and Stack Overflow posts with Java code fragments. Both approaches achieve high precision and recall values, but the PEG-based one achieves better computational performances and simplicity in engineering.
机译:软件存储库通常存储由结构化和非结构化部分组成的数据。研究人员挖掘这些数据以实证验证研究思路并支持从业人员的活动。结构化数据(例如源代码)具有正式的语法,易于分析;非结构化数据(例如文档)是自然语言,噪声和结构化数据片段的混合体,很难分析。尤其是非结构化数据中的结构化内容(例如,代码片段)包含有价值的信息。研究人员提出了几种识别,提取和分析以自然语言嵌入的结构化数据的方法。我们分析这些方法并研究其缺点。随后,我们提出了两种基于无扫描仪通用LR(SGLR)和解析表达式语法(PEG)的新颖方法,以解决这些缺点并在非结构化数据中挖掘结构化片段。我们在带有Java代码片段的开发电子邮件和Stack Overflow帖子上验证并比较了这些方法。两种方法都可以实现高精度和查全率,但是基于PEG的方法具有更好的计算性能和工程上的简便性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号