Mostrar el registro sencillo del ítem

dc.contributor.authorSaleem, Nasir
dc.contributor.authorGao, Jiechao
dc.contributor.authorIrfan, Muhammad
dc.contributor.authorVerdú, Elena
dc.contributor.authorParra Puente, Javier
dc.date2022
dc.date.accessioned2022-10-13T10:48:24Z
dc.date.available2022-10-13T10:48:24Z
dc.identifier.issn0262-8856
dc.identifier.urihttps://reunir.unir.net/handle/123456789/13610
dc.description.abstractSpeechreading which infers spoken message from a visually detected articulated facial trend is a challenging task. In this paper, we propose an end-to-end ResNet (E2E-ResNet) model for synthesizing speech signals from the silent video of a speaking individual. The model is the convolutional encoder-decoder framework which captures the frames of video and encodes into a latent space of visual features. The outputs of the decoder are spectrograms which are converted into waveforms corresponding to a speech articulated in the input video. The speech waveforms are then fed to a waveform critic used to decide the real or synthesized speech. The experiments show that the proposed E2E-V2SResNet model is apt to synthesize speech with realism and intelligibility/quality for GRID database. To further demonstrate the potentials of the proposed model, we also conduct experiments on the TCD-TIMIT database. We examine the synthesized speech in unseen speakers using three objective metrics use to measure the intelligibility, quality, and word error rate (WER) of the synthesized speech. We show that E2E-V2SResNet model outscores the competing approaches in most metrics on the GRID and TCD-TIMIT databases. By comparing with the baseline, the proposed model achieved 3.077% improvement in speech quality and 2.593% improvement in speech intelligibility. (c) 2022 Elsevier B.V. All rights reserved.es_ES
dc.language.isoenges_ES
dc.publisherImage and vision computinges_ES
dc.relation.ispartofseries;vol. 119
dc.relation.urihttps://www.sciencedirect.com/science/article/pii/S026288562200018X?via%3Dihubes_ES
dc.rightsopenAccesses_ES
dc.subjectvideo processinges_ES
dc.subjectE2E speech synthesises_ES
dc.subjectResNet-18es_ES
dc.subjectresidual CNNes_ES
dc.subjectwaveform CRITICes_ES
dc.subjectJCRes_ES
dc.subjectScopuses_ES
dc.titleE2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesises_ES
dc.typeArticulo Revista Indexadaes_ES
reunir.tag~ARIes_ES
dc.identifier.doihttps://doi.org/10.1016/j.imavis.2022.104389


Ficheros en el ítem

FicherosTamañoFormatoVer

No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem