With the remarkable success of the image captioning tasks, visual attention methods have become a vital part of captioning models. However, most attention-based image captioning methods do not consider any relationship among regions, which play a significant role in better image understanding. We proposed an image captioning method based on local relation network using a multilevel attention approach with graph neural network. It not only fully explores the relationship between the object and the image regions but also generates significant and context-based features corresponding to every region in the image. The attention employed in our work enhances the image representation capability of our method by focusing on a given image region and its related image regions. Thus addressing the relevant contextual information, spatial locations, and deep visual features leads to improve caption generation. We verified the effectiveness of the proposed model by conducting extensive experiments on three benchmark datasets: Flickr30k, MSCOCO, and nocaps. The results show the superiority of the proposed method over the existing methods both in quantitative and qualitative manners. Detailed ablation studies are conducted to communicate how each part would contribute to the final performance. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
CITATIONS
Cited by 5 scholarly publications.
Visualization
Image enhancement
Neural networks
Data modeling
Image understanding
Visual process modeling
Performance modeling