首页> 外文会议>Workshop on Resourcesfor African Indigenous Languages;Language Resources and Evaluation Conference >Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi
【24h】

Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi

机译:研究一种低资源语言数据集的创建,管理和分类方法:Setswana和Sepedi

获取原文

摘要

The recent advances in Natural Language Processing have only been a boon for well represented languages, negating research in lesser known global languages. This is in part due to the availability of curated data and research resources. One of the current challenges concerning low-resourced languages are clear guidelines on the collection, curation and preparation of datasets for different use-cases. In this work, we take on the task of creating two datasets that are focused on news headlines (i.e short text) for Setswana and Sepedi and the creation of a news topic classification task from these datasets. In this study, we document our work, propose baselines for classification, and investigate an approach on data augmentation better suited to low-resourced languages in order to improve the performance of the classifiers.
机译:自然语言处理技术的最新进展仅是代表性良好的语言的福音,否定了鲜为人知的全球语言的研究。这部分是由于提供了精选的数据和研究资源。有关资源匮乏的语言的当前挑战之一是针对不同用例的数据集的收集,整理和准备的明确指南。在这项工作中,我们承担的任务是为Setswana和Sepedi创建两个专注于新闻标题(即短文本)的数据集,并从这些数据集创建新闻主题分类任务。在这项研究中,我们记录了我们的工作,提出了分类基准,并研究了一种更适合于资源匮乏的语言的数据扩充方法,以提高分类器的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号