KEYWORDS: 3D modeling, Video, Video coding, Transformers, Video surveillance, Action recognition, Data modeling, Performance modeling, Feature extraction
In the field of video action recognition, the challenge of efficiently extracting video features while ensuring computational efficiency has been addressed in our study. We propose a novel video action recognition model named 3D ResNet-Transformer that integrates 3D ResNet (Residual Networks) with Transformer architecture. Utilizing 3D ResNet as the foundation, our model effectively captures spatial features of videos through its deep network structure. Additionally, the integration of Transformer encoding layers enhances temporal-spatial correlations between video features via its self-attention mechanism, thereby improving recognition accuracy. Our design synergizes the strengths of 3D ResNet and Transformer, combining their powerful capabilities effectively. Experimental results on standard video action recognition datasets, HMDB51 and UCF101, demonstrate superior performance of our model, with accuracy improvements of 3.4% and 0.4% over baseline models, achieving TOP-1 accuracies of 82.1% and 97.4%, respectively. This research validates the effectiveness and innovation of our integrated 3D ResNet and Transformer model in enhancing video recognition accuracy.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.