Lightweight Real-Time Recurrent Models for Speech Enhancement and Automatic Speech Recognition

Dhahbi, Sami; Saleem, Nasir; Gunawan, Teddy Surya; Bourouis, Sami; Ali, Imad; Trigui, Aymen; Algarni, Abeer D.

Autor:

Dhahbi, Sami

;

Saleem, Nasir

;

Gunawan, Teddy Surya

;

Bourouis, Sami

;

Ali, Imad

;

Trigui, Aymen

;

Algarni, Abeer D.

Fecha:

06/2024

Palabra clave:

real-time speech; simple recurrent unit (SRU); speech enhancement; speech processing; speech quality

Revista / editorial:

International Journal of Interactive Multimedia and Artificial Intelligence (IJIMAI)

Citación:

S. Dhahbi, N. Saleem, T. S. Gunawan, S. Bourouis, I. Ali, A. Trigui, A. D. Algarni. Lightweight Real-Time Recurrent Models for Speech Enhancement and Automatic Speech Recognition, International Journal of Interactive Multimedia and Artificial Intelligence, (2024), http://dx.doi.org/10.9781/ijimai.2024.04.003

Tipo de Ítem:

article

URI:

https://reunir.unir.net/handle/123456789/16570

DOI:

http://dx.doi.org/10.9781/ijimai.2024.04.003

Resumen:

Traditional recurrent neural networks (RNNs) encounter difficulty in capturing long-term temporal dependencies. However, lightweight recurrent models for speech enhancement are important to improve noisy speech, while being computationally efficient and able to capture long-term temporal dependencies efficiently. This study proposes a lightweight hourglass-shaped model for speech enhancement (SE) and automatic speech recognition (ASR). Simple recurrent units (SRU) with skip connections are implemented where attention gates are added to the skip connections, highlighting the important features and spectral regions. The model operates without relying on future information that is well-suited for real-time processing. Combined acoustic features and two training objectives are estimated. Experimental evaluations using the short time speech intelligibility (STOI), perceptual evaluation of speech quality (PESQ), and word error rates (WERs) indicate better intelligibility, perceptual quality, and word recognition rates. The composite measures further confirm the performance of residual noise and speech distortion. With the TIMIT database, the proposed model improves the STOI and PESQ by 16.21% and 0.69 (31.1%) whereas with the LibriSpeech database, the model improves STOI by 16.41% and PESQ by 0.71 (32.9%) over the noisy speech. Further, our model outperforms other deep neural networks (DNNs) in seen and unseen conditions. The ASR performance is measured using the Kaldi toolkit and achieves 15.13% WERs in noisy backgrounds.

Mostrar el registro completo del ítem

Ficheros en el ítem

Nombre: Lightweight Real-Time Recurrent Models for Speech Enhancement and Automatic Speech Recognition.pdf

Tamaño: 3.334Mb

Formato: application/pdf

Ver/Abrir

Este ítem aparece en la(s) siguiente(s) colección(es)

vol. 8, nº 6, june 2024

Año
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026

Vistas
0
0
0
0
0
0
0
0
0
0
0
0
229
202
64

Descargas
0
0
0
0
0
0
0
0
0
0
0
0
190
504
21