【24h】

The LDC-IL Speech Corpora

机译:LDC-IL语音集团

获取原文

摘要

This paper introduces the first set of speech corpora released in 2019 by the Linguistic Data Consortium for Indian Languages (LDC-IL), a scheme under the Department of Higher Education, Ministry of Human Resource Development, Government of India. The datasets include a total of 13 scheduled languages of India, collected in various environments across length and breadth of the vast country, from a total of 5662 speakers of different age-groups with a total size of more than 1552 hours. The dataset is still growing as we prune them and make them ready for release. Unique language corpus is usually the largest available at present for these languages. Established in 2008, on the lines of the LDC of University of Pennsylvania, the LDC-IL has worked for over 10 years on various types language resources, including building the speech corpora. LDC-IL is a fully government funded project implemented by CIIL, Mysuru. Due to some restraints in the government business such as cost analysis and copyright issues, it took rather a long time to release the LDC-IL dataset for the public use. This paper gives a brief of the raw speech corpora now released and ready for public use (both commercial and non-commercial purposes). It also discusses how the two major bottlenecks of copyright and costing was addressed which held up the release of these datasets for several years.
机译:本文介绍了2019年发布的第一套演讲语料集团,由印度语言(LDC-IL)是印度语文(LDC-IL),是印度政府的高等教育部,人力资源发展部的计划。数据集包括共计13名印度的印度语言,在庞大的国家的长度和广度上的各种环境中收集,共有5662个不同年龄组的扬声器,总规模超过1552小时。当我们修剪它们并使它们准备发布时,数据集仍在增长。唯一语言语料库通常是这些语言目前最大的语言。在2008年成立于2008年,在宾夕法尼亚州大学的最不发达国家,LDC-IL在各种语言资源上工作了超过10年,包括建立演讲语料库。 LDC-IL是由Mysuru CIIL实施的完全资助的项目。由于政府业务等一些限制,如成本分析和版权问题,因此需要相当长的时间来发布公众使用的LDC-IL数据集。本文简要介绍了现在发布的原始演讲语料库,并准备用于公共使用(商业和非商业目的)。它还讨论了如何解决版权和成本奏的两个主要瓶颈,这持有了几年的这些数据集的发布。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号