Mostrar el registro sencillo del ítem
E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis
dc.contributor.author | Saleem, Nasir | |
dc.contributor.author | Gao, Jiechao | |
dc.contributor.author | Irfan, Muhammad | |
dc.contributor.author | Verdú, Elena | |
dc.contributor.author | Parra Puente, Javier | |
dc.date | 2022 | |
dc.date.accessioned | 2022-10-13T10:48:24Z | |
dc.date.available | 2022-10-13T10:48:24Z | |
dc.identifier.issn | 0262-8856 | |
dc.identifier.uri | https://reunir.unir.net/handle/123456789/13610 | |
dc.description.abstract | Speechreading which infers spoken message from a visually detected articulated facial trend is a challenging task. In this paper, we propose an end-to-end ResNet (E2E-ResNet) model for synthesizing speech signals from the silent video of a speaking individual. The model is the convolutional encoder-decoder framework which captures the frames of video and encodes into a latent space of visual features. The outputs of the decoder are spectrograms which are converted into waveforms corresponding to a speech articulated in the input video. The speech waveforms are then fed to a waveform critic used to decide the real or synthesized speech. The experiments show that the proposed E2E-V2SResNet model is apt to synthesize speech with realism and intelligibility/quality for GRID database. To further demonstrate the potentials of the proposed model, we also conduct experiments on the TCD-TIMIT database. We examine the synthesized speech in unseen speakers using three objective metrics use to measure the intelligibility, quality, and word error rate (WER) of the synthesized speech. We show that E2E-V2SResNet model outscores the competing approaches in most metrics on the GRID and TCD-TIMIT databases. By comparing with the baseline, the proposed model achieved 3.077% improvement in speech quality and 2.593% improvement in speech intelligibility. (c) 2022 Elsevier B.V. All rights reserved. | es_ES |
dc.language.iso | eng | es_ES |
dc.publisher | Image and vision computing | es_ES |
dc.relation.ispartofseries | ;vol. 119 | |
dc.relation.uri | https://www.sciencedirect.com/science/article/pii/S026288562200018X?via%3Dihub | es_ES |
dc.rights | openAccess | es_ES |
dc.subject | video processing | es_ES |
dc.subject | E2E speech synthesis | es_ES |
dc.subject | ResNet-18 | es_ES |
dc.subject | residual CNN | es_ES |
dc.subject | waveform CRITIC | es_ES |
dc.subject | JCR | es_ES |
dc.subject | Scopus | es_ES |
dc.title | E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis | es_ES |
dc.type | Articulo Revista Indexada | es_ES |
reunir.tag | ~ARI | es_ES |
dc.identifier.doi | https://doi.org/10.1016/j.imavis.2022.104389 |
Ficheros en el ítem
Ficheros | Tamaño | Formato | Ver |
---|---|---|---|
No hay ficheros asociados a este ítem. |