Object detection is a challenging task in computer vision that involves predicting both the class and the location of objects in an image. Most existing methods rely on convolutional neural networks and hand-crafted modules, such as anchor boxes and non-maximum suppression. Recently, a novel end-to-end approach called DETR was proposed, which uses a transformer encoder-decoder structure to model object detection as a set prediction problem. However, DETR suffers from some limitations, such as poor performance on small objects and slow convergence speed. In this paper, we propose FF-DETR, a feature-fusion detection transformer that improves the performance and convergence speed of DETR-like models. FF-DETR introduces three feature fusion modules: (1) Contour Fusion FPN, which fuses multi-scale features using self-attention and deformable convolution; (2) Position-Content Query Fusion, which initializes the content query features by fusing the position query features and the encoder output features; and (3) Global Decoder Layer Fusion, which fuses the outputs of each decoder layer and updates the position query features iteratively. We conduct experiments on the COCO dataset and show that FF-DETR outperforms DETR and other variants in terms of accuracy and efficiency.
|