Paper
4 August 2022 Audio-visual multimodal speech recognition method fusing audio and video data
Jian Zhao, Zhiwei Zhang, Yuqing Cui, Wenqian Qiang, Weiqiang Dai, Zhejun Kuang, Lijuan Shi
Author Affiliations +
Proceedings Volume 12306, Second International Conference on Digital Signal and Computer Communications (DSCC 2022); 123060I (2022) https://doi.org/10.1117/12.2641355
Event: Second International Conference on Digital Signal and Computer Communications (DSCC 2022), 2022, Changchun, China
Abstract
With the widespread application of deep learning methods, multimodal techniques have also achieved rapid development. Since single-modal speech recognition may affect the accuracy of recognition results in noisy environments, multimodal fusion recognition gradually replaces the traditional single-modal recognition methods. In this paper, we mainly strengthen and pre-process audio and video data first, and use LSTM recurrent neural network for deep feature extraction of audio and video streams, which effectively solves the problem of long-term forgetting of general neural networks. The audio and video feature vectors are then fused by a fully connected neural network with linear connections. Compared with the speech recognition technique alone, this audiovisual fusion recognition method has a better recognition effect in the case of noise interference. Compared with the traditional audiovisual recognition method, the model simplifies the recognition work. Recognition experiments on the LRS2-BBC dataset show that the recognition accuracy of this method improves to a certain extent over that of other methods in a clean environment and greatly improves in noisy conditions.
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Jian Zhao, Zhiwei Zhang, Yuqing Cui, Wenqian Qiang, Weiqiang Dai, Zhejun Kuang, and Lijuan Shi "Audio-visual multimodal speech recognition method fusing audio and video data", Proc. SPIE 12306, Second International Conference on Digital Signal and Computer Communications (DSCC 2022), 123060I (4 August 2022); https://doi.org/10.1117/12.2641355
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Neural networks

Feature extraction

Video

Speech recognition

Data modeling

Visualization

3D modeling

Back to Top