首页> 外国专利> Method and system for creating frugal speech corpus using internet resources and conventional speech corpus

Method and system for creating frugal speech corpus using internet resources and conventional speech corpus

机译:利用互联网资源和常规语音语料创建节俭语音语料的方法和系统

摘要

A speech corpus creation method and system are disclosed. The method comprising identifying a publicly accessible first source of the first speech data and its corresponding first text transcription; extracting a second speech data of an accessible encoding format from the first speech data; extracting a second text transcription data with at least one encoding format from the first text transcription data; matching and aligning the transcription to the extracted second speech data at a sentence, word, phoneme level, or combination thereof to form a first and a second speech corpus; analyzing the text transcriptions in the second speech corpus to identify the short speech segments to produce a phonetically balanced, segmented, text aligned third speech corpus; and conditioning the third speech corpus by inserting a context and associated environment richer corpus therein the third speech corpus from at least one second source to form the final speech corpus.
机译:公开了语料库创建方法和系统。该方法包括:识别第一语音数据的公共可访问的第一来源及其对应的第一文本转录;以及从第一语音数据中提取可访问编码格式的第二语音数据;从第一文本转录数据中提取具有至少一种编码格式的第二文本转录数据;在句子,单词,音素水平或它们的组合上将转录与提取的第二语音数据进行匹配和对齐,以形成第一和第二语音语料库;分析第二语音语料库中的文本转录,以识别短语音段,以产生语音平衡,分段,文本对齐的第三语音语料库;通过从至少一个第二源向其中插入上下文和相关的环境较丰富的语料库来调节第三语音语料库,以形成最终语音语料库。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号