Lightweight Real-Time Recurrent Models for Speech Enhancement and Automatic Speech Recognition

Dhahbi, Sami; Saleem, Nasir; Gunawan, Teddy Surya; Bourouis, Sami; Ali, Imad; Trigui, Aymen; Algarni, Abeer D.

dc.contributor.author	Dhahbi, Sami
dc.contributor.author	Saleem, Nasir
dc.contributor.author	Gunawan, Teddy Surya
dc.contributor.author	Bourouis, Sami
dc.contributor.author	Ali, Imad
dc.contributor.author	Trigui, Aymen
dc.contributor.author	Algarni, Abeer D.
dc.date	2024-06
dc.date.accessioned	2024-05-13T16:16:38Z
dc.date.available	2024-05-13T16:16:38Z
dc.identifier.citation	S. Dhahbi, N. Saleem, T. S. Gunawan, S. Bourouis, I. Ali, A. Trigui, A. D. Algarni. Lightweight Real-Time Recurrent Models for Speech Enhancement and Automatic Speech Recognition, International Journal of Interactive Multimedia and Artificial Intelligence, (2024), http://dx.doi.org/10.9781/ijimai.2024.04.003	es_ES
dc.identifier.citation
dc.identifier.uri	https://reunir.unir.net/handle/123456789/16570
dc.description.abstract	Traditional recurrent neural networks (RNNs) encounter difficulty in capturing long-term temporal dependencies. However, lightweight recurrent models for speech enhancement are important to improve noisy speech, while being computationally efficient and able to capture long-term temporal dependencies efficiently. This study proposes a lightweight hourglass-shaped model for speech enhancement (SE) and automatic speech recognition (ASR). Simple recurrent units (SRU) with skip connections are implemented where attention gates are added to the skip connections, highlighting the important features and spectral regions. The model operates without relying on future information that is well-suited for real-time processing. Combined acoustic features and two training objectives are estimated. Experimental evaluations using the short time speech intelligibility (STOI), perceptual evaluation of speech quality (PESQ), and word error rates (WERs) indicate better intelligibility, perceptual quality, and word recognition rates. The composite measures further confirm the performance of residual noise and speech distortion. With the TIMIT database, the proposed model improves the STOI and PESQ by 16.21% and 0.69 (31.1%) whereas with the LibriSpeech database, the model improves STOI by 16.41% and PESQ by 0.71 (32.9%) over the noisy speech. Further, our model outperforms other deep neural networks (DNNs) in seen and unseen conditions. The ASR performance is measured using the Kaldi toolkit and achieves 15.13% WERs in noisy backgrounds.	es_ES
dc.language.iso	eng	es_ES
dc.publisher	International Journal of Interactive Multimedia and Artificial Intelligence (IJIMAI)	es_ES
dc.relation.ispartofseries	;vol. 8, nº 6
dc.rights	openAccess	es_ES
dc.subject	real-time speech	es_ES
dc.subject	simple recurrent unit (SRU)	es_ES
dc.subject	speech enhancement	es_ES
dc.subject	speech processing	es_ES
dc.subject	speech quality	es_ES
dc.title	Lightweight Real-Time Recurrent Models for Speech Enhancement and Automatic Speech Recognition	es_ES
dc.type	article	es_ES
reunir.tag	~IJIMAI	es_ES
dc.identifier.doi	http://dx.doi.org/10.9781/ijimai.2024.04.003

Ficheros en el ítem

Nombre:: Lightweight Real-Time Recurrent ...
Tamaño:: 3.334Mb
Formato:: PDF

Ver/Abrir

Este ítem aparece en la(s) siguiente(s) colección(ones)

vol. 8, nº 6, june 2024

Mostrar el registro sencillo del ítem

Lightweight Real-Time Recurrent Models for Speech Enhancement and Automatic Speech Recognition

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)

Ítems relacionados

Efficient Gated Convolutional Recurrent Neural Networks for Real-Time Speech Enhancement ﻿

E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis ﻿

On improvement of speech intelligibility and quality: a survey of unsupervised single channel speech enhancement algorithms ﻿

Efficient Gated Convolutional Recurrent Neural Networks for Real-Time Speech Enhancement

E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis

On improvement of speech intelligibility and quality: a survey of unsupervised single channel speech enhancement algorithms