首页> 外文会议>International Conference on Language Resources and Evaluation >SiNER: A Large Dataset for Sindhi Named Entity Recognition
【24h】

SiNER: A Large Dataset for Sindhi Named Entity Recognition

机译:罪恶:Sindhi命名实体识别的大型数据集

获取原文

摘要

We introduce the SiNER: a named entity recognition (NER) dataset for low-resourced Sindhi language with quality baselines. It contains 1,338 news articles and more than 1.35 million tokens collected from Kawish and Awami Awaz Sindhi newspapers using the begin-inside-outside (BIO) tagging scheme. The proposed dataset is likely to be a significant resource for statistical Sindhi language processing. The ultimate goal of developing SiNER is to present a gold-standard dataset for Sindhi NER along with quality baselines. We implement several baseline approaches of conditional random field (CRF) and recent popular state-of-the-art bi-directional long-short term memory (Bi-LSTM) models. The promising Fl-score of 89.16% outputted by the Bi-LSTM-CRF model with character-level representations demonstrates the quality of our proposed SiNER dataset.
机译:我们介绍了罪人:一个名为实体识别(ner)数据集,用于低资源的Sindhi语言,具有质量基线。它包含1,338个新闻文章,并从Kawish和Awami Awaz Sindhi报纸收集了超过135万令牌,使用开始内外(BIO)标记方案。建议的数据集可能是统计Sindhi语言处理的重要资源。发展犯罪者的最终目标是为Sindhi Ner提供金牌标准数据集以及质量基线。我们实施了几种条件随机场(CRF)的基线方法和最近的最新的最先进的双向长期记忆(Bi-LSTM)模型。 BI-LSTM-CRF模型与字符级表示输出的有希望的FL分数为89.16%,展示了我们提出的朝南数据集的质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号