The LDC-IL Speech Corpora

机译：LDC-IL语音集团

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper introduces the first set of speech corpora released in 2019 by the Linguistic Data Consortium for Indian Languages (LDC-IL), a scheme under the Department of Higher Education, Ministry of Human Resource Development, Government of India. The datasets include a total of 13 scheduled languages of India, collected in various environments across length and breadth of the vast country, from a total of 5662 speakers of different age-groups with a total size of more than 1552 hours. The dataset is still growing as we prune them and make them ready for release. Unique language corpus is usually the largest available at present for these languages. Established in 2008, on the lines of the LDC of University of Pennsylvania, the LDC-IL has worked for over 10 years on various types language resources, including building the speech corpora. LDC-IL is a fully government funded project implemented by CIIL, Mysuru. Due to some restraints in the government business such as cost analysis and copyright issues, it took rather a long time to release the LDC-IL dataset for the public use. This paper gives a brief of the raw speech corpora now released and ready for public use (both commercial and non-commercial purposes). It also discusses how the two major bottlenecks of copyright and costing was addressed which held up the release of these datasets for several years.

机译：本文介绍了2019年发布的第一套演讲语料集团，由印度语言（LDC-IL）是印度语文（LDC-IL），是印度政府的高等教育部，人力资源发展部的计划。数据集包括共计13名印度的印度语言，在庞大的国家的长度和广度上的各种环境中收集，共有5662个不同年龄组的扬声器，总规模超过1552小时。当我们修剪它们并使它们准备发布时，数据集仍在增长。唯一语言语料库通常是这些语言目前最大的语言。在2008年成立于2008年，在宾夕法尼亚州大学的最不发达国家，LDC-IL在各种语言资源上工作了超过10年，包括建立演讲语料库。 LDC-IL是由Mysuru CIIL实施的完全资助的项目。由于政府业务等一些限制，如成本分析和版权问题，因此需要相当长的时间来发布公众使用的LDC-IL数据集。本文简要介绍了现在发布的原始演讲语料库，并准备用于公共使用（商业和非商业目的）。它还讨论了如何解决版权和成本奏的两个主要瓶颈，这持有了几年的这些数据集的发布。

著录项

来源
《Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques 》|2020年|28-32|共5页
会议地点
作者
Narayan Choudhary; D. G. Rao;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Government; Education; Linguistics; Internet; Annotations; Statistics; Stakeholders;

机译：政府;教育;语言学;互联网;注释;统计;利益相关者;

相似文献

外文文献
中文文献
专利

1. Review of Development of Speech corpora and speech recognition research in Hindi [J] . Dr.Harshalata Petkar International Journal of Engineering Research and Applications . 2017 ,第7期

机译：印地语语音语料库发展与语音识别研究述评
2. Evaluation of speech corpora for speech and speaker recognition systems [J] . Jacek SLIMOK, Jan KOTAS Pomiary Automatyka Kontrola . 2014 ,第6期

机译：语音和说话者识别系统的语音语料库评估
3. Polish unit selection speech synthesis with BOSS: extensions and speech corpora [J] . Grazyna Demenko, rnKatarzyna Klessa, rnMarcin Szymanski, International journal of speech technology . 2010 ,第2期

机译：使用BOSS进行波兰语单元选择语音合成：扩展和语音语料库
4. Construction of Chinese Conversational Corpora for Spontaneous Speech Recognition and Comparative Study on the Trilingual Parallel Corpora [C] . Xinhui Hu, Ryosuke Isotani, Satoshi Nakamura Speech Database and Assessments, 2009. . 2009

机译：自发性语音识别汉语会话语料库的构建及三语平行语料库的比较研究
5. Joint Approaches for Learning Word Representations from Text Corpora and Knowledge Bases [D] . Alsuhaibani, Mohammed. 2020

机译：从文本语料库和知识库学习词语的联合方法
6. Conventions for sign and speech transcription of child bimodal bilingual corpora in ELAN [O] . Deborah Chen Pichler, Julie A. Hochgesang, Diane Lillo-Martin, -1

机译：伊朗儿童双峰双语语料库的签署和语音转录的公约
7. Degrees of Orality in Speech-like Corpora: Comparative Annotation of Chat and E-mail Corpora [O] . Bick Eckhard 2011

机译：语音语料库中的口语程度：聊天和电子邮件语料库的比较注释
8. Object-Based Modelling for Representing and Processing Speech Corpora [R] . Altosaar, T. 2001

机译：基于对象的语音语料库表示与处理建模

The LDC-IL Speech Corpora

摘要

著录项

相似文献

相关主题

期刊订阅