SiNER: A Large Dataset for Sindhi Named Entity Recognition

机译：罪恶：Sindhi命名实体识别的大型数据集

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We introduce the SiNER: a named entity recognition (NER) dataset for low-resourced Sindhi language with quality baselines. It contains 1,338 news articles and more than 1.35 million tokens collected from Kawish and Awami Awaz Sindhi newspapers using the begin-inside-outside (BIO) tagging scheme. The proposed dataset is likely to be a significant resource for statistical Sindhi language processing. The ultimate goal of developing SiNER is to present a gold-standard dataset for Sindhi NER along with quality baselines. We implement several baseline approaches of conditional random field (CRF) and recent popular state-of-the-art bi-directional long-short term memory (Bi-LSTM) models. The promising Fl-score of 89.16% outputted by the Bi-LSTM-CRF model with character-level representations demonstrates the quality of our proposed SiNER dataset.

机译：我们介绍了罪人：一个名为实体识别（ner）数据集，用于低资源的Sindhi语言，具有质量基线。它包含1,338个新闻文章，并从Kawish和Awami Awaz Sindhi报纸收集了超过135万令牌，使用开始内外（BIO）标记方案。建议的数据集可能是统计Sindhi语言处理的重要资源。发展犯罪者的最终目标是为Sindhi Ner提供金牌标准数据集以及质量基线。我们实施了几种条件随机场（CRF）的基线方法和最近的最新的最先进的双向长期记忆（Bi-LSTM）模型。 BI-LSTM-CRF模型与字符级表示输出的有希望的FL分数为89.16％，展示了我们提出的朝南数据集的质量。

著录项

来源
《International Conference on Language Resources and Evaluation》|2020年|2953-2961|共9页
会议地点
作者
Wazir Ali; Junyu Lu; Zenglin Xu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Language Resources; SiNER; Sindhi Language; Named Entity Recognition;

机译：语言资源;罪犯;辛迪语言;命名实体认可;

相似文献

外文文献
中文文献
专利

1. Biomedical named entity recognition and linking datasets: survey and our recent development [J] . Ming-Siang Huang, Po-Ting Lai, Pei-Yen Lin, Briefings in bioinformatics . 2020,第6期

机译：生物医学命名实体识别和链接数据集：调查和我们最近的发展
2. Interlinking SciGraph and DBpedia Datasets Using Link Discovery and Named Entity Recognition Techniques [J] . Beyza Yaman, Michele Pasin, Markus Freudenberg OASIcs : OpenAccess Series in Informatics . 2019,第1期

机译：使用链接发现和命名实体识别技术互连SciGraph和DBpedia数据集
3. Dataset-aware multi-task learning approaches for biomedical named entity recognition [J] . Zuo Mei, Zhang Yang Bioinformatics . 2020,第15期

机译：DataSet感知生物医学名为实体识别的多任务学习方法
4. All that Glitters Is Not Gold - Rule-Based Curation of Reference Datasets for Named Entity Recognition and Entity Linking [C] . Kunal Jha, Michael Roder, Axel-Cyrille Ngonga Ngomo International Conference on Semantic Web . 2017

机译：所有闪闪发光的东西都不是金子-命名实体识别和实体链接的基于规则的参考数据集管理
5. Semi-supervised Named Entity Recognition: Learning to recognize 100 entity types with little supervision [D] . Nadeau, David. 2007

机译：半监督的命名实体识别：在很少的监督下学习识别100种实体类型
6. OryzaGP: rice gene and protein dataset for named-entity recognition [O] . Pierre Larmande, Huy Do, Yue Wang 2019

机译：OryzaGP：水稻基因和蛋白质数据集用于命名实体识别
7. An Arabic Dataset for Disease Named Entity Recognition with Multi-Annotation Schemes [O] . Nasser Alshammari, Saad Alanazi 2020

机译：一个名为实体识别的阿拉伯数据集，具有多元注释方案

SiNER: A Large Dataset for Sindhi Named Entity Recognition

摘要

著录项

相似文献

相关主题

期刊订阅