Gabor-based audiovisual fusion for Mandarin Chinese speech recognition

Xu, Yan and Wang, Hongce and Dong, Zhongping and Li, Yuexuan and Abel, Andrew; (2022) Gabor-based audiovisual fusion for Mandarin Chinese speech recognition. In: 2022 30th European Signal Processing Conference (EUSIPCO). 2022 30th European Signal Processing Conference (EUSIPCO) . IEEE, SRB, pp. 603-607. ISBN 9789082797091 (https://doi.org/10.23919/eusipco55093.2022.9909634)

[thumbnail of Xu-etal-EUSIPCO-2022-Gabor-based-audiovisual-fusion-for-Mandarin-Chinese-speech-recognition]
Preview
Text. Filename: Xu_etal_EUSIPCO_2022_Gabor_based_audiovisual_fusion_for_Mandarin_Chinese_speech_recognition.pdf
Accepted Author Manuscript
License: Strathprints license 1.0

Download (607kB)| Preview

Abstract

Audiovisual Speech Recognition (AVSR) is a popular research topic, and incorporating visual features into speech recognition systems has been found to deliver good results. In recent years, end-to-end Convolutional Neural Network (CNN) based deep learning has been widely utilized. However, these often require big data and can be time consuming to train. A lot of speech research also tends to focus on English language datasets. In this paper, we propose a lightweight optimized and automated speech recognition system using Gabor based feature extraction, combined with our Audiovisual Mandarin Chinese (AVMC) corpus. This combines Mel-frequency Cepstral Coefficients (MFCCs) + CNN_Bidirectional Long Short-term Memory (BiLSTM)_Attention (CLA) model for Audio Speech Recognition, and low dimension Gabor visual features + CLA model for Visual Speech Recognition. As we are focusing on Chinese language recognition, we individually analyse initials, finals, and tones, as part of pinyin speech production. The proposed low dimensionality system achieves 88.6%, 87.5% and 93.6% accuracy for tones, initials and finals respectively at char-level, 84.8% for pinyin at word-level.