Cross-modal fusion and temporal enhancement for egocentric action recognition

Dengdi Sun; Xueliang Zhang; Bin Luo; Zhuanlian Ding

doi:10.1117/12.3033627

13 June 2024 Cross-modal fusion and temporal enhancement for egocentric action recognition

Dengdi Sun, Xueliang Zhang, Bin Luo, Zhuanlian Ding

Proceedings Volume 13180, International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024); 131805S (2024) https://doi.org/10.1117/12.3033627
Event: International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024), 2024, Guangzhou, China

Abstract

Egocentric action recognition is an important research topic in the field of video understanding, which focuses on parsing and understanding the actions of characters in videos. In this paper, we propose a novel egocentric action recognition method, which deeply mines the text information within the labels and closely combines the video data with the text information, thus achieving a more comprehensive and detailed behavior pattern capture. By incorporating the dual-flow network architecture of RGB and Optical flow, our method can more accurately capture the temporal dynamic information in videos, thereby improving the spatio-temporal representation of our model in egocentric action recognition. In order to verify the effectiveness and generality of the proposed method, we performed experiments on several commonly used first-view datasets (such as EGTEA and GTEA gaze+). The results demonstrate that compared with the state-of-the-art, our method achieves significant improvements in both key metrics, accuracy and average class accuracy.

(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.

Citation Download Citation

Dengdi Sun, Xueliang Zhang, Bin Luo, and Zhuanlian Ding "Cross-modal fusion and temporal enhancement for egocentric action recognition", Proc. SPIE 13180, International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024), 131805S (13 June 2024); https://doi.org/10.1117/12.3033627

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
7 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Video

Action recognition

Feature fusion

Optical flow

Transformers

Video coding

Video processing

Show All Keywords

Keywords/Phrases

Search In:

Publication Years