首页> 外文期刊>Audio, Speech, and Language Processing, IEEE/ACM Transactions on >Learning Structured Sparse Representations for Voice Conversion
【24h】

Learning Structured Sparse Representations for Voice Conversion

机译:学习用于语音转换的结构稀疏表示

获取原文
获取原文并翻译 | 示例

摘要

Sparse-coding techniques for voice conversion assume that an utterance can be decomposed into a sparse code that only carries linguistic contents, and a dictionary of atoms that captures the speakers’ characteristics. However, conventional dictionary-construction and sparse-coding algorithms rarely meet this assumption. The result is that the sparse code is no longer speaker-independent, which leads to lower voice-conversion performance. In this paper, we propose a Cluster-Structured Sparse Representation (CSSR) that improves the speaker independence of the representations. CSSR consists of two complementary components: a Cluster-Structured Dictionary Learning module that groups atoms in the dictionary into clusters, and a Cluster-Selective Objective Function that encourages each speech frame to be represented by atoms from a small number of clusters. We conducted four experiments on the CMU ARCTIC corpus to evaluate the proposed method. In a first ablation study, results show that each of the two CSSR components enhances speaker independence, and that combining both components leads to further improvements. In a second experiment, we find that CSSR uses increasingly larger dictionaries more efficiently than phoneme-based representations by allowing finer-grained decompositions of speech sounds. In a third experiment, results from objective and subjective measurements show that CSSR outperforms prior voice-conversion methods, improving the acoustic quality of the synthesized speech while retaining the target speaker's voice identity. Finally, we show that the CSSR captures latent (i.e., phonetic) information in the speech signal.
机译:用于语音转换的稀疏编码技术假设话语可以分解成仅携带语言内容的稀疏代码,以及捕获扬声器的特征的原子字典。然而,传统的字典构建和稀疏编码算法很少符合此假设。结果是,稀疏代码不再是扬声器无关,这导致较低的语音转换性能。在本文中,我们提出了一种结构化稀疏表示(CSSR),其提高了表示的扬声器独立性。 CSSR由两个互补组件组成:一个集群结构化词典学习模块,其将字典中的原子组分组到集群中,以及鼓励每个语音帧的簇选择性目标函数由少量簇由原子表示为由颗粒表示。我们在CMU北极语料库上进行了四个实验,以评估所提出的方法。在第一个消融研究中,结果表明,两个CSSR组件中的每一个增强了扬声器独立性,并且组合两个组件都会导致进一步的改进。在第二个实验中,我们发现CSSR通过允许更精细的语音分解来更有效地使用越来越大的词典。在第三个实验中,目标和主观测量结果表明,CSSR优于现有的语音转换方法,提高了合成语音的声学质量,同时保留了目标扬声器的语音标识。最后,我们表明CSSR在语音信号中捕获潜伏(即语音)信息。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号