• Mi Re-Unir
    Búsqueda Avanzada
    JavaScript is disabled for your browser. Some features of this site may not work without it.
    Ver ítem 
    •   Inicio
    • RESULTADOS DE INVESTIGACIÓN
    • Artículos Científicos WOS y SCOPUS
    • Ver ítem
    •   Inicio
    • RESULTADOS DE INVESTIGACIÓN
    • Artículos Científicos WOS y SCOPUS
    • Ver ítem

    E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis

    Autor: 
    Saleem, Nasir
    ;
    Gao, Jiechao
    ;
    Irfan, Muhammad
    ;
    Verdú, Elena
    ;
    Parra Puente, Javier
    Fecha: 
    2022
    Palabra clave: 
    video processing; E2E speech synthesis; ResNet-18; residual CNN; waveform CRITIC; JCR; Scopus
    Revista / editorial: 
    Image and vision computing
    Tipo de Ítem: 
    Articulo Revista Indexada
    URI: 
    https://reunir.unir.net/handle/123456789/13610
    DOI: 
    https://doi.org/10.1016/j.imavis.2022.104389
    Dirección web: 
    https://www.sciencedirect.com/science/article/pii/S026288562200018X?via%3Dihub
    Open Access
    Resumen:
    Speechreading which infers spoken message from a visually detected articulated facial trend is a challenging task. In this paper, we propose an end-to-end ResNet (E2E-ResNet) model for synthesizing speech signals from the silent video of a speaking individual. The model is the convolutional encoder-decoder framework which captures the frames of video and encodes into a latent space of visual features. The outputs of the decoder are spectrograms which are converted into waveforms corresponding to a speech articulated in the input video. The speech waveforms are then fed to a waveform critic used to decide the real or synthesized speech. The experiments show that the proposed E2E-V2SResNet model is apt to synthesize speech with realism and intelligibility/quality for GRID database. To further demonstrate the potentials of the proposed model, we also conduct experiments on the TCD-TIMIT database. We examine the synthesized speech in unseen speakers using three objective metrics use to measure the intelligibility, quality, and word error rate (WER) of the synthesized speech. We show that E2E-V2SResNet model outscores the competing approaches in most metrics on the GRID and TCD-TIMIT databases. By comparing with the baseline, the proposed model achieved 3.077% improvement in speech quality and 2.593% improvement in speech intelligibility. (c) 2022 Elsevier B.V. All rights reserved.
    Mostrar el registro completo del ítem
    Este ítem aparece en la(s) siguiente(s) colección(es)
    • Artículos Científicos WOS y SCOPUS

    Estadísticas de uso

    Año
    2012
    2013
    2014
    2015
    2016
    2017
    2018
    2019
    2020
    2021
    2022
    2023
    2024
    2025
    Vistas
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    12
    51
    115
    150
    Descargas
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0
    0

    Ítems relacionados

    Mostrando ítems relacionados por Título, autor o materia.

    • Regularized sparse features for noisy speech enhancement using deep neural networks 

      Khattak, Muhammad Irfan; Saleem, Nasir; Gao, Jiechao; Verdú, Elena ; Parra Fuente, Javier (Computers and Electrical Engineering, 2022)
      A speech enhancement algorithm improves the perceptual aspects of a speech degraded by noise signals. We propose a phase-aware deep neural network (DNN) using the regularized sparse features for speech enhancement. A ...
    • On improvement of speech intelligibility and quality: a survey of unsupervised single channel speech enhancement algorithms 

      Saleem, Nasir; Khattak, Muhammad Irfan; Verdú, Elena (International Journal of Interactive Multimedia and Artificial Intelligence, 06/2020)
      Many forms of human communication exist; for instance, text and nonverbal based. Speech is, however, the most powerful and dexterous form for the humans. Speech signals enable humans to communicate and this usefulness of ...
    • On Improvement of Speech Intelligibility and Quality: A Survey of Unsupervised Single Channel Speech Enhancement Algorithms 

      Saleem, Nasir; Khattak, Muhammad Irfan; Verdú, Elena (International Journal of Interactive Multimedia and Artificial Intelligence (IJIMAI), 06/2020)
      Many forms of human communication exist; for instance, text and nonverbal based. Speech is, however, the most powerful and dexterous form for the humans. Speech signals enable humans to communicate and this usefulness of ...

    Mi cuenta

    AccederRegistrar

    ¿necesitas ayuda?

    Manual de UsuarioContacto: reunir@unir.net

    Listar

    todo Re-UnirComunidades y coleccionesPor fecha de publicaciónAutoresTítulosPalabras claveTipo documentoTipo de accesoEsta colecciónPor fecha de publicaciónAutoresTítulosPalabras claveTipo documentoTipo de acceso






    Aviso Legal Política de Privacidad Política de Cookies Cláusulas legales RGPD
    © UNIR - Universidad Internacional de La Rioja
     
    Aviso Legal Política de Privacidad Política de Cookies Cláusulas legales RGPD
    © UNIR - Universidad Internacional de La Rioja