Title Multispectral image caption unification using diffusion and cycle GAN models
Authors Komurcu, Kursat ; Petkevičius, Linas
DOI 10.1109/ACCESS.2025.3632152
Full Text Download
Is Part of IEEE Access.. Piscataway, NJ : Institute of Electrical and Electronics Engineers (IEEE). 2025, vol. 13, p. 193708-193718.. eISSN 2169-3536
Abstract [eng] A major limitation is the scarcity of geospatial datasets that simultaneously provide multispectral imagery and descriptive captions. In particular, datasets containing aligned RGB, multispectral, and caption information remain highly limited. Therefore, we propose a full-circle pipeline to unify triplets of RGB images, image captions, and Sentinel-2–like multispectral data. To accomplish this, we combine a fine-tuned Stable Diffusion model with a Cycle GAN trained on generated images and the EuroSAT dataset. First, we use Qwen2-VL-2B as a zero shot method to generate captions for 675,993 images from the SkyScript dataset. We then fine-tune the Stable Diffusion 2–1 Base model on these image–caption pairs and generate randomly selected 123,081 RGB images conditioned on the Qwen2-VL-2B captions. Finally, we train a Cycle GAN on roughly 27,000 paired RGB and multispectral images and use it to translate synthetic RGB images into multispectral counterparts. In this way, textual prompts produce synthetic satellite imagery that can be converted to multispectral Sentinel-2 data. The pipeline enables unifying datasets that contain only captions or only RGB images by producing complete triplets (caption, RGB, multispectral). Quantitative evaluations support the credibility of the approach: generated captions achieve a SkyCLIP Score of 0.7312, the fine-tuned Stable Diffusion model achieves a CMMD of 0.245, and the Cycle GAN multispectral outputs reach a SAM of 10.16° in our synthetic dataset versus 13.94° on EuroSAT. The code, models and the dataset links are available at GitHub and Hugging Face.
Published Piscataway, NJ : Institute of Electrical and Electronics Engineers (IEEE)
Type Journal article
Language English
Publication date 2025
CC license CC license description