首页> 外文会议>International Seminar on Intelligent Technology and Its Applications >Semi-supervised learning approach for Indonesian Named Entity Recognition (NER) using co-training algorithm
【24h】

Semi-supervised learning approach for Indonesian Named Entity Recognition (NER) using co-training algorithm

机译:使用协同训练算法的印度尼西亚命名实体识别(NER)的半监督学习方法

获取原文

摘要

The problem of utilizing machine learning approach in Indonesian Named Entity Recognition (NER) system is the limited amount of labelled data for training process. However, unlike the limited availability of labelled data, unlabelled data is widely available from many sources. This enables a semi-supervised learning approach to solve this NER system problem. This research aims to design a semi-supervised learning model to solve NER system problem. A semi-supervised co-training learning is used to utilize unlabelled data in NER learning process to produce new labelled data that can be applied to enhance a new NER classification system. This research uses two kinds of data, Indonesian DBPedia data as labelled data and news article text from Indonesian news sites (kompas.com, cnnindonesia.com, tempo.co, merdeka.com and viva.co.id) as unlabelled data. The pre-processing steps applied to analyze unstructured text are sentence segmentation, tokenization, stemming, and PoS Tagging. The results of this pre-process are the NER and its context used as unlabelled data for the semi-supervised co-training process. The SVM algorithm is used as a classi□cation algorithm in this process. 10 Cross Fold Validation is used as the system testing approach. Based on the result of the NER testing system, the precision is 73.6%, the recall is 80.1% and f1 mean is 76.5%.
机译:在印度尼西亚命名实体识别(NER)系统中使用机器学习方法的问题是训练过程中标记数据的数量有限。但是,与标记数据的可用性有限不同,未标记数据可从许多来源广泛获得。这使半监督学习方法可以解决此NER系统问题。本研究旨在设计一种半监督学习模型来解决NER系统问题。半监督协同训练学习用于在NER学习过程中利用未标记的数据来生成新的标记数据,这些数据可用于增强新的NER分类系统。本研究使用两种数据,即印度尼西亚DBPedia数据作为标记数据和来自印度尼西亚新闻站点(kompas.com,cnnindonesia.com,tempo.co,merdeka.com和viva.co.id)的新闻文章文本作为未标记数据。用于分析非结构化文本的预处理步骤是句子分段,标记化,词干和PoS标记。此预处理的结果是NER及其上下文用作半监督式联合训练过程的未标记数据。在此过程中,将SVM算法用作分类算法。 10交叉折叠验证用作系统测试方法。根据NER测试系统的结果,精度为73.6%,召回率为80.1%,f1平均值为76.5%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号