首页> 外文会议>International Seminar on Intelligent Technology and Its Applications >Semi-supervised learning approach for Indonesian Named Entity Recognition (NER) using co-training algorithm
【24h】

Semi-supervised learning approach for Indonesian Named Entity Recognition (NER) using co-training algorithm

机译:使用共同训练算法的印度尼西亚指定实体识别(ner)的半监督学习方法

获取原文

摘要

The problem of utilizing machine learning approach in Indonesian Named Entity Recognition (NER) system is the limited amount of labelled data for training process. However, unlike the limited availability of labelled data, unlabelled data is widely available from many sources. This enables a semi-supervised learning approach to solve this NER system problem. This research aims to design a semi-supervised learning model to solve NER system problem. A semi-supervised co-training learning is used to utilize unlabelled data in NER learning process to produce new labelled data that can be applied to enhance a new NER classification system. This research uses two kinds of data, Indonesian DBPedia data as labelled data and news article text from Indonesian news sites (kompas.com, cnnindonesia.com, tempo.co, merdeka.com and viva.co.id) as unlabelled data. The pre-processing steps applied to analyze unstructured text are sentence segmentation, tokenization, stemming, and PoS Tagging. The results of this pre-process are the NER and its context used as unlabelled data for the semi-supervised co-training process. The SVM algorithm is used as a classi□cation algorithm in this process. 10 Cross Fold Validation is used as the system testing approach. Based on the result of the NER testing system, the precision is 73.6%, the recall is 80.1% and f1 mean is 76.5%.
机译:利用机器学习方法在印度尼西亚命名实体识别(NER)系统中的问题是用于训练过程的标记数据量有限。但是,与标记数据的有限可用性不同,未标记的数据广泛可从许多来源获得。这使得半监督的学习方法能够解决这个问题的问题。本研究旨在设计半监督学习模型来解决新系统问题。半监督的共同培训学习用于利用NER学习过程中的未标记数据来产生可以应用的新标记数据,以增强新的NER分类系统。本研究使用两种数据,印度尼西亚DBPedia数据作为标记的数据和新闻文本来自印度尼西亚新闻网站的新闻文本(Kompas.com,CNNIndonesia.com,Tempo.co,Merdeka.com和Viva.co.Id)作为未标记的数据。应用于分析非结构化文本的预处理步骤是句子分割,标记化,止算和POS标记。该预处理的结果是NER及其上下文,用作半监督共同培训过程的未标记数据。 SVM算法在此过程中用作类别□阳离子算法。 10交叉折叠验证用作系统测试方法。基于NER测试系统的结果,精度为73.6%,召回为80.1%,F1平均值为76.5%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号