With the increasing coverage of video surveillance systems in modern society, demand for using artificial intelligence algorithm to replace humans in violent behavior recognition has also become stronger. By moving some channels in the temporal dimension, temporary shift module (TSM) can achieve the performance of three-dimensional convolution neural network (CNN) with the complexity of two-dimensional CNN, and extract the temporal and spatial information at the same time. Our intuition is that too many temporary shift modules may fuse too much action information in each frame, which weakens the capability of CNN on spatiotemporal information extraction. To verify the aforementioned conjecture, we adjusted the network structure based on TSM, proposed partial TSM, selected the optimal model through experiments, and verified the performance of the algorithm on multiple datasets and our expanded datasets. The proposed optimal model not only reduced the memory usage of hardware but also achieved higher accuracy on multiple datasets with 77.3% running time. Meanwhile, we achieved state-of-the-art performance of 91% on RWF-2000 dataset. |
CITATIONS
Cited by 2 scholarly publications.
Convolution
Video
Video surveillance
Convolutional neural networks
Data modeling
Performance modeling
Detection and tracking algorithms