Video action recognition is a vital task in the field of computer vision. A great deal of redundant information is generated along with original video data in the process of depth computation. In order to solve this problem, most existing methods improve recognition speed at the cost of recognition accuracy. In this paper, we propose a new framework: Fastformer which is a transformer-based structure for fast inference video classification to further improve model inference speed while maintaining accuracy. To achieve the balance of speed and accuracy, we solve the inter-frame and intra-frame redundancy of video and design a new self-attention network, which uses the improved highway network to make the model realize the same function as the traditional self-attention module, while greatly reducing the amount of calculation and the number of required parameters. We conduct experiments to verify the effect of our model. Overall, Fastformer significantly outperforms existing vision transformers with regard to the speed versus accuracy trade-off. For example, at 76.4% Keyframes-400 accuracy, Fastformer is 28% faster than TimeSformer.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.