首页> 外文会议>Chinese Spoken Language Processing; Lecture Notes in Artificial Intelligence; 4274 >HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus
【24h】

HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus

机译:香港科技大学/ MTS:大型国语电话语音语料库

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS) from over 2100 Mandarin speakers in mainland China under the DARPA EARS framework. The corpus includes speech data, transcriptions and speaker demographic information. The speech data include 1206 ten-minute natural Mandarin conversations between either strangers or friends. Each conversation focuses on a single topic. All calls are recorded over public telephone networks. All calls are manually annotated with standard Chinese characters (GBK) as well as specific mark-ups for spontaneous speech. A file with speaker demographic information is also provided. The corpus is the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other application-dependent tasks, such as topic detection, information retrieval, keyword spotting, speaker recognition, etc. In a 2004 evaluation test by NIST, the corpus is found to improve system performance quite significantly.
机译:本文描述了在DARPA EARS框架下,来自中国大陆2100多名普通话发言人200个小时的HKUST普通话语音语料库(HKUST / MTS)的设计,收集,转录和分析。语料库包括语音数据,转录和说话者人口统计信息。语音数据包括陌生人或朋友之间的1206十分钟自然普通话对话。每个对话都集中在一个主题上。所有呼叫都通过公用电话网络记录。所有呼叫均使用标准汉字(GBK)以及用于自发语音的特定标记进行手动注释。还提供了一个包含说话者人口统计信息的文件。语料库是普通话会话电话语音中规模最大,种类最全的一种,为普通话语音识别和其他与应用程序有关的任务(例如主题检测,信息检索,关键词识别,说话者识别等)提供了丰富多样的样本。在NIST的2004年评估测试中,发现语料库显着改善了系统性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号