...
首页> 外文期刊>Procedia Computer Science >Collaborative Speech Data Acquisition for Under Resourced Languages through Crowdsourcing
【24h】

Collaborative Speech Data Acquisition for Under Resourced Languages through Crowdsourcing

机译:通过众包获取资源贫乏语言的协作语音数据

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Scarcity of resources in under resourced languages may leave these languages behind in race of development of data driven NLP systems. Crowdsourcing has come up as a technique to bridge this gap, as it offers approach for collecting such resources in collaborative manner. Though some of Indian languages are widely spoken throughout the world yet many of them are resource poor when it is measured in terms of availability of transcribed and annotated resources for building reliable data driven systems. This paper describes an experience of speech data collection for Hindi through mobile using this approach for building automatic speech recognition and other speech based retrieval systems. This approach covers a lot of variety in terms of microphones and surrounding environment etc. Besides cost saving and speedy data collection it offers the advantage of adaptation of the framework for collecting different types of resources for various applications in language independent manner like word sense disambiguation, Named Entity Recognition, Sentiment Analysis etc. Experiences, analysis and challenges faced in recordings of more than 100 speakers are reported.
机译:资源匮乏的语言中的资源稀缺可能会使这些语言落后于数据驱动的NLP系统的开发竞赛。众包已经成为一种弥合这种差距的技术,因为它提供了以协作方式收集此类资源的方法。尽管一些印度语言在世界范围内被广泛使用,但从用于构建可靠的数据驱动系统的转录和注释资源的可用性来衡量时,其中许多语言资源贫乏。本文介绍了使用此方法构建自动语音识别和其他基于语音的检索系统时,通过移动设备收集印地语语音数据的经验。这种方法涵盖了麦克风和周围环境等方面的多种变化。除了节省成本和快速收集数据外,它还具有适应框架的优势,该框架可以以独立于语言的方式(如词义消除,命名为实体识别,情感分析等。据报道,在记录100多位演讲者的过程中遇到的经验,分析和挑战。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号