Paper
13 June 2024 Cross-modal fusion and temporal enhancement for egocentric action recognition
Dengdi Sun, Xueliang Zhang, Bin Luo, Zhuanlian Ding
Author Affiliations +
Proceedings Volume 13180, International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024); 131805S (2024) https://doi.org/10.1117/12.3033627
Event: International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024), 2024, Guangzhou, China
Abstract
Egocentric action recognition is an important research topic in the field of video understanding, which focuses on parsing and understanding the actions of characters in videos. In this paper, we propose a novel egocentric action recognition method, which deeply mines the text information within the labels and closely combines the video data with the text information, thus achieving a more comprehensive and detailed behavior pattern capture. By incorporating the dual-flow network architecture of RGB and Optical flow, our method can more accurately capture the temporal dynamic information in videos, thereby improving the spatio-temporal representation of our model in egocentric action recognition. In order to verify the effectiveness and generality of the proposed method, we performed experiments on several commonly used first-view datasets (such as EGTEA and GTEA gaze+). The results demonstrate that compared with the state-of-the-art, our method achieves significant improvements in both key metrics, accuracy and average class accuracy.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Dengdi Sun, Xueliang Zhang, Bin Luo, and Zhuanlian Ding "Cross-modal fusion and temporal enhancement for egocentric action recognition", Proc. SPIE 13180, International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024), 131805S (13 June 2024); https://doi.org/10.1117/12.3033627
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Video

Action recognition

Feature fusion

Optical flow

Transformers

Video coding

Video processing

Back to Top