Recently, vision transformers started to show impressive results that significantly outperform large convolution-based models. We propose a gating network for salient object detection. The Pyramid Vision Transformer serves as the backbone network of this gating network, learning global and local representations with its self-attention mechanism. Multi-level gating units are used to recover more details of the saliency map by establishing cooperation among different levels of features, thus improving the discriminability of the whole network. With the help of multi-level gating units, the valuable context information from the encoder can be optimally transmitted to the decoder. The pyramid pooling module collects high-level semantic information. Moreover, the semantic information of each level is integrated and decoded by the feature aggregation decoder. The experimental results on five challenging benchmark databases demonstrate that the proposed method achieves more favorable performance than the current state-of-the-art methods in terms of four evaluation criteria. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one