A New Language Independent, Photo-realistic Talking Head Driven by Voice Only

机译：一种独立的新语言，仅由语音驱动的照片逼真的谈话头

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We propose a new photo-realistic, voice driven only (i.e. no linguistic info of the voice input is needed) talking head. The core of the new talking head is a context-dependent, multilayer, Deep Neural Network (DNN), which is discriminatively trained over hundreds of hours, speaker independent speech data. The trained DNN is then used to map acoustic speech input to 9,000 tied "senone" states probabilistically. For each photo-realistic talking head, an HMM-based lips motion synthesizer is trained over the speaker's audio/visual training data where states are statistically mapped to the corresponding lips images. In test, for given speech input, DNN predicts the likely states in theirposterior probabilities and photo-realistic lips animation is then rendered through the DNN predicted state lattice. The DNN trained on English, speaker independent data has also been tested with other language input, e.g. Mandarin, Spanish, etc. to mimic the lips movements cross-lingually. Subjective experiments show that lip motions thus rendered for 15 non-English languages are highly synchronized with the audio input and photo-realistic to human eyes perceptually.

机译：我们提出了一种新的照片逼真，声音驱动（即，不需要语音输入的语言信息）谈话。新的谈话头的核心是依赖于上下文，多层的深神经网络（DNN），其差别地训练了数百小时，扬声器独立的语音数据。然后，训练的DNN将用于将声音语音输入到9,000张绑定的“塞诺诺”状态。对于每个照片逼真的谈话头，基于赫姆的嘴唇运动合成器训练在扬声器的音频/可视训练数据上，其中各种统计映射到相应的嘴唇图像。在测试中，对于给定的语音输入，DNN预测其具有概率概率中的可能状态，然后通过DNN预测状态格子呈现照片 - 现实嘴唇动画。 DNN培训英语，扬声器独立数据也用其他语言输入进行了测试，例如，普通话，西班牙语等模仿嘴唇的运动。主观实验表明，为15个非英语提供了如此呈现的唇部运动与感知的人眼的音频输入和照片真实性高度同步。

著录项

来源
《Conference of the International Speech Communication Association》|2013年||共5页
会议地点
作者
Xinjian Zhang; Lijuan Wang; Gang Li; Frank Seide; Frank K. Soong;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TN912.3-532;
关键词
Language; hundreds; rendered;

机译：语言;数百;渲染;

相似文献

外文文献
中文文献
专利

1. Photo-realistic talking-heads from image samples [J] . Cosatto E., Graf H.P. IEEE transactions on multimedia . 2000,第3期

机译：来自图像样本的逼真的谈话头
2. Voices from above - voices from below. Who is talking and who is listening in Norwegian language politics? [J] . Current Issues in Language Planning . 2010,第2期

机译：上方的声音-下方的声音。在挪威语言政治中谁在说话和谁在听？
3. Voices from above - voices from below. Who is talking and who is listening in Norwegian language politics? [J] . Andrew R. Linn Current issues in language planning . 2010,第2期

机译：上方的声音-下方的声音。在挪威语言政治中谁在说话和谁在听？
4. A New Language Independent, Photo-realistic Talking Head Driven by Voice Only [C] . Xinjian Zhang, Lijuan Wang, Gang Li, Conference of the International Speech Communication Association . 2013

机译：一种独立的新语言，仅由语音驱动的照片逼真的谈话头
5. Pairing media-captured human versus computer-synthesized humanoid faces and voices for talking heads: A consistency theory for interface agents. [D] . Gong, Li. 2001

机译：将媒体捕获的人与计算机合成的人形面部和声音配对以用于说话人：接口代理的一致性理论。
6. Language-independent talker-specificity in first-language and second-language speech production by bilingual talkers: L1 speaking rate predicts L2 speaking rate [O] . Ann R. Bradlow, Midam Kim, Michael Blasingame -1

机译：双语说话者在第一语言和第二语言语音产生中与语言无关的说话者特异性：L1说话率预测L2说话率
7. Audio-Visual Unit Selection for the Synthesis of Photo-Realistic Talking-Heads [O] . Eric Cosatto, Gerasimos Potamianos, Hans Peter Graf 2000

机译：用于合成逼真的说话头的视听单元选择

A New Language Independent, Photo-realistic Talking Head Driven by Voice Only

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅