Collaborative Speech Data Acquisition for Under Resourced Languages through Crowdsourcing

Sunita Arora; Karunesh Kumar Arora; Mukund Kumar Roy; S.S. Agrawal; B.K. Murthy

首页> 外文期刊>Procedia Computer Science >Collaborative Speech Data Acquisition for Under Resourced Languages through Crowdsourcing

【24h】

Collaborative Speech Data Acquisition for Under Resourced Languages through Crowdsourcing

机译：通过众包获取资源贫乏语言的协作语音数据

获取原文

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Scarcity of resources in under resourced languages may leave these languages behind in race of development of data driven NLP systems. Crowdsourcing has come up as a technique to bridge this gap, as it offers approach for collecting such resources in collaborative manner. Though some of Indian languages are widely spoken throughout the world yet many of them are resource poor when it is measured in terms of availability of transcribed and annotated resources for building reliable data driven systems. This paper describes an experience of speech data collection for Hindi through mobile using this approach for building automatic speech recognition and other speech based retrieval systems. This approach covers a lot of variety in terms of microphones and surrounding environment etc. Besides cost saving and speedy data collection it offers the advantage of adaptation of the framework for collecting different types of resources for various applications in language independent manner like word sense disambiguation, Named Entity Recognition, Sentiment Analysis etc. Experiences, analysis and challenges faced in recordings of more than 100 speakers are reported.

机译：资源匮乏的语言中的资源稀缺可能会使这些语言落后于数据驱动的NLP系统的开发竞赛。众包已经成为一种弥合这种差距的技术，因为它提供了以协作方式收集此类资源的方法。尽管一些印度语言在世界范围内被广泛使用，但从用于构建可靠的数据驱动系统的转录和注释资源的可用性来衡量时，其中许多语言资源贫乏。本文介绍了使用此方法构建自动语音识别和其他基于语音的检索系统时，通过移动设备收集印地语语音数据的经验。这种方法涵盖了麦克风和周围环境等方面的多种变化。除了节省成本和快速收集数据外，它还具有适应框架的优势，该框架可以以独立于语言的方式（如词义消除，命名为实体识别，情感分析等。据报道，在记录100多位演讲者的过程中遇到的经验，分析和挑战。

著录项

来源
《Procedia Computer Science》 |2016年第1期|共8页
作者
Sunita Arora; Karunesh Kumar Arora; Mukund Kumar Roy; S.S. Agrawal; B.K. Murthy;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Eyra - Speech Data Acquisition System for Many Languages [J] . Matthias Petursson, Simon Klüpfel, Jon Gudnason Procedia Computer Science . 2016,第1期

机译：Eyra-多种语言的语音数据采集系统
2. The Usefulness of Imperfect Speech Data for ASR Development in Low-Resource Languages [J] . Jaco Badenhorst, Febe de Wet Information . 2019,第9期

机译：不完善的语音数据对低资源语言ASR开发的有用性
3. Speech recognition for under-resourced languages: Data sharing in hidden Markov model systems [J] . Febe de Wet, Neil Kleynhans, Dirk van Compernolle, South African Journal of Science . 2017,第1a2期

机译：资源不足语言的语音识别：隐马尔可夫模型系统中的数据共享
4. Crowdsourcing Speech and Language Data for Resource-Poor Languages [C] . Hamdy Mubarak International Conference on Advanced Intelligent Systems and Informatics . 2017

机译：资源差别语言的众包语音和语言数据
5. Text-to-Speech Synthesis Using Found Data for Low-Resource Languages [D] . Cooper, Erica 2019

机译：使用低资源语言的数据对文本进行语音合成
6. Natural Language Control of Resources for Experimental Data Acquisition Systems [O] . Robert A. Harbort Jr., David Franklin, James H. Spencer 1980

机译：实验数据采集系统资源的自然语言控制
7. Collaborative Speech Data Acquisition for Under Resourced Languages through Crowdsourcing [O] . Arora Sunita, Arora Karunesh Kumar, Roy Mukund Kumar, 2016

机译：通过众包获取资源贫乏语言的协作语音数据

Collaborative Speech Data Acquisition for Under Resourced Languages through Crowdsourcing

摘要

著录项

相似文献

相关主题

期刊订阅