首页> 外文会议>International Conference on Machine Learning for Cyber Security >Software Entity Recognition Method Based on BERT Embedding
【24h】

Software Entity Recognition Method Based on BERT Embedding

机译:基于BERT嵌入的软件实体识别方法

获取原文

摘要

The global open source software ecosystem contains rich information in the field of software engineering. The existing analysis methods for the text content of the knowledge community in this field are mainly focus on the structural relationship and rule-based association and mining. This paper proposes a software entity recognition method based on BERT word embedding. Firstly, the BiLSTM-CRF model is constructed, and the entity recognition model is constructed by combining the word vector embedding in software engineering field. Then, the word vector in the input layer of the model is improved by introducing the BERT pre-training language model. In the process of pre-training of BERT, the pre-training data should be constructed based on the discussion content of Stack Overflow software Q & A community. Then, we use these data to pre-training the BERT model, so as to obtain the word vector representation suitable for software engineering field, improving the effect of entity recognition in software engineering field, and solving the problem that the traditional word vector embedding is mostly based on the general domain data training, which is not fully suitable for software engineering field, and can't well represent the context semantic information. At the same time, to solve the problem that there are few annotated data in the field of software, this paper tries to extends the data appropriately by the method of model prediction and dictionary matching, and carries out experimental test. Finally, this paper uses the method of deep learning to realize the entity recognition in the field of software engineering, so as to provide support for the extraction of software entities, the construction of software knowledge base, and the intelligent application of software engineering.
机译:全球开源软件生态系统包含在软件工程领域的丰富信息。本领域知识社区文本内容的现有分析方法主要集中在结构关系和基于规则的关联和挖掘。本文提出了一种基于BERT Word嵌入的软件实体识别方法。首先,构造Bilstm-CRF模型,并且通过组合软件工程字段中的单词矢量来构建实体识别模型。然后,通过引入BERT预训练语言模型来改进模型的输入层中的单词向量。在伯特预训练的过程中,应根据堆栈溢出软件Q&A社区的讨论内容构建预训练数据。然后,我们使用这些数据来预先训练BERT模型,以便获得适合软件工程领域的单词矢量表示,提高实体识别在软件工程领域的影响,并解决传统文字媒体嵌入的问题主要基于普通域数据培训,这不完全适合软件工程字段,并且不能很好地代表上下文语义信息。与此同时,为了解决软件领域的注释数据很少的问题,本文试图通过模型预测和字典匹配的方法适当地扩展数据,并进行实验测试。最后,本文使用深度学习方法来实现软件工程领域的实体识别,以便为软件实体的提取,软件知识库的构建提供支持,以及软件工程的智能应用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号