E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis

Saleem, Nasir; Gao, Jiechao; Irfan, Muhammad; Verdú, Elena; Parra Puente, Javier

dc.contributor.author	Saleem, Nasir
dc.contributor.author	Gao, Jiechao
dc.contributor.author	Irfan, Muhammad
dc.contributor.author	Verdú, Elena
dc.contributor.author	Parra Puente, Javier
dc.date	2022
dc.date.accessioned	2022-10-13T10:48:24Z
dc.date.available	2022-10-13T10:48:24Z
dc.identifier.issn	0262-8856
dc.identifier.uri	https://reunir.unir.net/handle/123456789/13610
dc.description.abstract	Speechreading which infers spoken message from a visually detected articulated facial trend is a challenging task. In this paper, we propose an end-to-end ResNet (E2E-ResNet) model for synthesizing speech signals from the silent video of a speaking individual. The model is the convolutional encoder-decoder framework which captures the frames of video and encodes into a latent space of visual features. The outputs of the decoder are spectrograms which are converted into waveforms corresponding to a speech articulated in the input video. The speech waveforms are then fed to a waveform critic used to decide the real or synthesized speech. The experiments show that the proposed E2E-V2SResNet model is apt to synthesize speech with realism and intelligibility/quality for GRID database. To further demonstrate the potentials of the proposed model, we also conduct experiments on the TCD-TIMIT database. We examine the synthesized speech in unseen speakers using three objective metrics use to measure the intelligibility, quality, and word error rate (WER) of the synthesized speech. We show that E2E-V2SResNet model outscores the competing approaches in most metrics on the GRID and TCD-TIMIT databases. By comparing with the baseline, the proposed model achieved 3.077% improvement in speech quality and 2.593% improvement in speech intelligibility. (c) 2022 Elsevier B.V. All rights reserved.	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Image and vision computing	es_ES
dc.relation.ispartofseries	;vol. 119
dc.relation.uri	https://www.sciencedirect.com/science/article/pii/S026288562200018X?via%3Dihub	es_ES
dc.rights	openAccess	es_ES
dc.subject	video processing	es_ES
dc.subject	E2E speech synthesis	es_ES
dc.subject	ResNet-18	es_ES
dc.subject	residual CNN	es_ES
dc.subject	waveform CRITIC	es_ES
dc.subject	JCR	es_ES
dc.subject	Scopus	es_ES
dc.title	E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis	es_ES
dc.type	Articulo Revista Indexada	es_ES
reunir.tag	~ARI	es_ES
dc.identifier.doi	https://doi.org/10.1016/j.imavis.2022.104389

Ficheros en el ítem

Ficheros	Tamaño	Formato	Ver
No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)

Artículos Científicos WOS y SCOPUS

Mostrar el registro sencillo del ítem

E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)

Ítems relacionados

Regularized sparse features for noisy speech enhancement using deep neural networks ﻿

On improvement of speech intelligibility and quality: a survey of unsupervised single channel speech enhancement algorithms ﻿

On Improvement of Speech Intelligibility and Quality: A Survey of Unsupervised Single Channel Speech Enhancement Algorithms ﻿

Regularized sparse features for noisy speech enhancement using deep neural networks

On improvement of speech intelligibility and quality: a survey of unsupervised single channel speech enhancement algorithms

On Improvement of Speech Intelligibility and Quality: A Survey of Unsupervised Single Channel Speech Enhancement Algorithms