Automatic Target Recognition (ATR) often confronts intricate visual scenes, necessitating models capable of discerning subtle nuances. Real-world datasets like the Defense Systems Information Analysis Center (DSIAC) ATR database exhibit unimodal characteristics, hindering performance, and lack contextual information for each frame. To address these limitations, we enrich the DSIAC dataset by algorithmically generating captions and proposing new train/test splits, thereby creating a rich multimodal training landscape. To effectively leverage these captions, we explore the integration of a vision-language model, specifically Contrastive Language-Image Pre-training (CLIP), which combines visual perception with linguistic descriptors. At the core of our methodology lies a homotopy-based multi-objective optimization technique, designed to achieve a harmonious balance between model precision, generalizability, and interpretability. Our framework, developed using PyTorch Lightning and Ray Tune for advanced distributed hyperparameter optimization, enhances models to meet the intricate demands of practical ATR applications. All code and data is available at https://github.com/sabraha2/ATR-CLIP-Multi-Objective-Homotopy-Optimization.
|