Audio-visual multimodal speech recognition method fusing audio and video data

Jian Zhao; Zhiwei Zhang; Yuqing Cui; Wenqian Qiang; Weiqiang Dai; Zhejun Kuang; Lijuan Shi

doi:10.1117/12.2641355

4 August 2022 Audio-visual multimodal speech recognition method fusing audio and video data

Jian Zhao, Zhiwei Zhang, Yuqing Cui, Wenqian Qiang, Weiqiang Dai, Zhejun Kuang, Lijuan Shi

Proceedings Volume 12306, Second International Conference on Digital Signal and Computer Communications (DSCC 2022); 123060I (2022) https://doi.org/10.1117/12.2641355
Event: Second International Conference on Digital Signal and Computer Communications (DSCC 2022), 2022, Changchun, China

Abstract

With the widespread application of deep learning methods, multimodal techniques have also achieved rapid development. Since single-modal speech recognition may affect the accuracy of recognition results in noisy environments, multimodal fusion recognition gradually replaces the traditional single-modal recognition methods. In this paper, we mainly strengthen and pre-process audio and video data first, and use LSTM recurrent neural network for deep feature extraction of audio and video streams, which effectively solves the problem of long-term forgetting of general neural networks. The audio and video feature vectors are then fused by a fully connected neural network with linear connections. Compared with the speech recognition technique alone, this audiovisual fusion recognition method has a better recognition effect in the case of noise interference. Compared with the traditional audiovisual recognition method, the model simplifies the recognition work. Recognition experiments on the LRS2-BBC dataset show that the recognition accuracy of this method improves to a certain extent over that of other methods in a clean environment and greatly improves in noisy conditions.

Citation Download Citation

Jian Zhao, Zhiwei Zhang, Yuqing Cui, Wenqian Qiang, Weiqiang Dai, Zhejun Kuang, and Lijuan Shi "Audio-visual multimodal speech recognition method fusing audio and video data", Proc. SPIE 12306, Second International Conference on Digital Signal and Computer Communications (DSCC 2022), 123060I (4 August 2022); https://doi.org/10.1117/12.2641355

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
5 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Neural networks

Feature extraction

Video

Speech recognition

Data modeling

Visualization

3D modeling

Show All Keywords

Keywords/Phrases

Search In:

Publication Years