Paper
16 August 2024 Distance-aware multilayer aggregation transformer for image captioning
Honglong Xu, Lixia Xue, Ronggui Wang, Juan Yang
Author Affiliations +
Proceedings Volume 13230, Third International Conference on Machine Vision, Automatic Identification, and Detection (MVAID 2024); 132302G (2024) https://doi.org/10.1117/12.3035618
Event: Third International Conference on Machine Vision, Automatic Identification and Detection, 2024, Kunming, China
Abstract
Recently, the Transformer based on grid features has achieved great success in image captioning, but it still has some problems: the flattening operation of Transformer will destroy the positional information among visual objects, and only the output of the last layer encoder is sent to the decoder, which will lose low-level semantic information. To solve the above problems, we first introduce Distance-aware self-attention (DA), which considers the original geometric distance between visual objects in a two-dimensional image during the self-attention modeling process, and integrates distance information into attention calculation through a mapping function, better capturing the relational information among visual objects. Second, we propose the Multilayer Aggregation (MA) module, which aggregates the output of the encoder and establishes a weighted residual connection as the final output, sent to the decoder separately. It aggregates information from different encoder layers to achieve cross-layer semantic complementarity, features with rich semantics can be explored simultaneously from both low-level and high-level coding layers. To verify the validity of our proposed two designs, we applied them to a standard Transformer and conducted extensive experiments on MS-COCO, a benchmark dataset for image captioning. The experimental results demonstrate the effectiveness of our proposed Distance-aware Multilayer Aggregation Transformer (DMAT) model.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Honglong Xu, Lixia Xue, Ronggui Wang, and Juan Yang "Distance-aware multilayer aggregation transformer for image captioning", Proc. SPIE 13230, Third International Conference on Machine Vision, Automatic Identification, and Detection (MVAID 2024), 132302G (16 August 2024); https://doi.org/10.1117/12.3035618
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Transformers

Visualization

Semantics

Performance modeling

Matrices

Information visualization

Education and training

Back to Top