首页> 外文会议>International conference on Asian language processing >Building the Indonesian NE Dataset Using Wikipedia and DBpedia with Entities Expansion Method on DBpedia
【24h】

Building the Indonesian NE Dataset Using Wikipedia and DBpedia with Entities Expansion Method on DBpedia

机译:使用Wikipedia和DBpedia结合实体扩展方法在DBpedia上构建印尼语NE数据集

获取原文

摘要

In Indonesian, the NER (Named Entity Recognition)system still needs a lot of improvement. Though NER is the main component in IE (Information Extraction)which is used by other advanced components. To create a reliable Indonesian NER system using a machine learning approach, large dataset is needed. If the dataset is constructed by tagging it manually, the size of the dataset generated is very small. Therefore, a system was created to build Indonesian NE (Named Entities)dataset which were tagged automatically using Wikipedia data as a source of corpus and DBpedia as NE labeling reference with the Entities Expansion method to expand DBpedia NE labeling reference. Currently, the existing system cannot detect name that contain words beginning with lowercase letter on automatic tagging, the existing system have not tried adding person entity gazetteers, and the DBpedia Entities Expansion method rules can still be modified to produce better NE labeling reference quality. In this study a system was built to overcome these shortcomings. Evaluation showed that the best Indonesian NE dataset was built in this study produced Fl-score of 54.93 %, 3.32 % higher than the result of previous studies 51.61 %. This best dataset was built by adding a detection method on automatic tagging, that using the DBpedia Entities Expansion modification rules in this study, but without adding person entity gazetteers.
机译:在印度尼西亚语中,NER(命名实体识别)系统仍然需要大量改进。尽管NER是IE(信息提取)中的主要组件,但其他高级组件也使用NER。为了使用机器学习方法创建可靠的印度尼西亚NER系统,需要大数据集。如果数据集是通过手动标记构建的,则生成的数据集的大小将非常小。因此,创建了一个系统来构建印尼语NE(命名实体)数据集,该数据集使用Wikipedia数据作为语料源,并使用Entities Expansion方法将DBpedia用作NE标签参考来自动标记,以扩展DBpedia NE标签参考。当前,现有系统无法在自动标记中检测到包含以小写字母开头的单词的名称,现有系统尚未尝试添加人实体地名词典,并且仍可以修改DBpedia实体扩展方法规则以产生更好的NE标签参考质量。在这项研究中,建立了一个克服这些缺点的系统。评估显示,本研究建立的最佳印尼NE数据集产生的Fl得分为54.93%,比先前研究的结果51.61%高3.32%。该最佳数据集是通过在自动标记上添加一种检测方法而构建的,该方法在本研究中使用了DBpedia实体扩展修改规则,但未添加人员实体地名词典。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号