Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi

机译：研究一种低资源语言数据集的创建，管理和分类方法：Setswana和Sepedi

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The recent advances in Natural Language Processing have only been a boon for well represented languages, negating research in lesser known global languages. This is in part due to the availability of curated data and research resources. One of the current challenges concerning low-resourced languages are clear guidelines on the collection, curation and preparation of datasets for different use-cases. In this work, we take on the task of creating two datasets that are focused on news headlines (i.e short text) for Setswana and Sepedi and the creation of a news topic classification task from these datasets. In this study, we document our work, propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages in order to improve the performance of the classifiers.

机译：自然语言处理技术的最新进展仅是代表性良好的语言的福音，否定了鲜为人知的全球语言的研究。这部分是由于提供了精选的数据和研究资源。有关资源匮乏的语言的当前挑战之一是针对不同用例的数据集的收集，整理和准备的明确指南。在这项工作中，我们承担的任务是为Setswana和Sepedi创建两个专注于新闻标题（即短文本）的数据集，并从这些数据集创建新闻主题分类任务。在这项研究中，我们记录了我们的工作，提出了分类基准，并研究了一种更适合于资源匮乏的语言的数据扩充方法，以提高分类器的性能。

著录项

来源
《Workshop on Resourcesfor African Indigenous Languages;Language Resources and Evaluation Conference》|2020年|15-20|共6页
会议地点
作者
Vukosi Marivate; Tshephisho Sefara; Vongani Chabalala; Keamogetswe Makhaya; Tumisho Mokgonyane; Rethabile Mokoena; Abiodun Modupe;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Towards automated generation of curated datasets in radiology: Application of natural language processing to unstructured reports exemplified on CT for pulmonary embolism [J] . European Journal of Radiology . 2020,第期

机译：在放射学自动化的策划数据集中：自然语言处理在肺栓塞CT中举例说明的非结构化报告
2. Curation of datasets, assessment of their quality and completeness, and nanoSAR classification model development for metallic nanoparticles [J] . Trinh Tung X., Ha My Kieu, Choi Jang Sik, Environmental Science: Nano . 2018,第8期

机译：数据集策划，对金属纳米粒子的纳米纳尔分类模型开发的评估
3. ExpressionData - A public resource of high quality curated datasets representing gene expression across anatomy, development and experimental conditions [J] . Philip Zimmermann, Stefan Bleuler, Oliver Laule, BioData Mining . 2014,第1期

机译：ExpressionData-高质量的精选数据集的公共资源，代表跨解剖，发育和实验条件的基因表达
4. Dataset Creation and Evaluation of Aspect Based Sentiment Analysis in Telugu, a Low Resource Language [C] . Regatte Yashwanth Reddy, Gangula Rama Rohit Reddy, Radhika Mamidi International Conference on Language Resources and Evaluation . 2020

机译：数据集创建和评估泰卢国遥控基于方面的情绪分析，低资源语言
5. Speculum: Characterizing the Creation, Curation, Reproduction, and Neglect of Women’s Health Information on the English Language Wikipedia [D] . Menking, Amanda. 2019

机译：窥探：对英语维基百科女性健康信息的创作，策划，复制和忽视的特征
6. ExpressionData - A public resource of high quality curated datasets representing gene expression across anatomy development and experimental conditions [O] . Philip Zimmermann, Stefan Bleuler, Oliver Laule, 2014

机译：ExpressionData-高质量的精选数据集的公共资源代表跨解剖发育和实验条件的基因表达
7. ExpressionData - A public resource of high quality curated datasets representing gene expression across anatomy, development and experimental conditions [O] . Philip Zimmermann, Stefan Bleuler, Oliver Laule, 2014

机译：ExpressionData-高质量的精选数据集的公共资源，代表跨解剖，发育和实验条件的基因表达
8. Linguistic-Core Approach to Structured Translation and Analysis of Low-Resource Languages. [R] . Carbonell, J., Levin, L., Smith, N., 2017

机译：结构化翻译的语言核心方法与低资源语言分析。

Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi

摘要

著录项

相似文献

相关主题

期刊订阅