C-DiffSET: Leveraging Latent Diffusion for SAR-to-EO Image Translation with Confidence-Guided Reliable Object Generation
Official PyTorch implementation of the paper "C-DiffSET: Leveraging Latent Diffusion for SAR-to-EO Image Translation with Confidence-Guided Reliable Object Generation", accepted to IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2026.
Jeonghyeok Do1, Jaehyup Lee2, Seungchul Lee3, and Munchurl Kim1,†
1 School of Electrical Engineering, KAIST, Daejeon, Korea 2 School of Computer Science and Engineering, Kyungpook National University, Daegu, Korea 3 Stellarvision Inc., Daejeon, Korea
† Corresponding author
- Jul 1, 2026: Code and pretrained models are released. 🎉
- Jun 1, 2026: C-DiffSET is accepted to IEEE TCSVT. 🎉
- Dec 9, 2024: This repository is created.
SAR imagery works in all weather and lighting conditions but is hard to interpret because of speckle noise and its unusual structure. C-DiffSET translates SAR into EO (electro-optical) imagery by fine-tuning a pretrained Latent Diffusion Model (Stable Diffusion v2.1) rather than training from scratch, which mitigates overfitting on the small paired SAR–EO datasets.
Two ideas make this work:
- Shared latent space via a frozen VAE. SAR and EO images are embedded by the same pretrained VAE encoder. The SAR latent is concatenated (channel-wise) with the noisy EO latent and used to condition the denoising U-Net, which preserves pixel-wise correspondence.
- Confidence-guided diffusion (C-Diff) loss. The U-Net additionally predicts a per-pixel variance. A β-NLL–style loss down-weights uncertain regions, so temporal discrepancies (objects that appear/disappear between the SAR and EO acquisitions) are not aggressively hallucinated.
C-DiffSET reaches state-of-the-art results on QXS-SAROPT, SAR2Opt, SpaceNet6, and a proprietary Stellar-Vision dataset.
Overview of the proposed C-DiffSET framework. (a) Training: a frozen VAE encodes the SAR image X and EO image Y into a shared latent space; noise is added to the EO latent in the forward process, and the denoising U-Net predicts both the noise ε̂t and a confidence map ĉt conditioned on the SAR latent zx, optimized by the C-Diff loss. (b) Inference: an EO image Ŷ is generated from a SAR input by iteratively denoising a random latent ẑyT.
The pretrained SD 2.1 U-Net is modified in two places (train.py):
| Layer | Original | Modified | Purpose |
|---|---|---|---|
conv_in |
4 ch | 8 ch | concat [SAR latent | noisy EO latent]; original weights duplicated and scaled by 0.5 |
conv_out |
4 ch | 5 ch | 4 channels for predicted noise + 1 channel for raw variance (zero-initialized) |
The VAE and CLIP text encoder are frozen; only the U-Net is fine-tuned. Following the paper, a fixed prompt ("electro-optical image") is used as a semantic anchor at both training and inference time, instead of a null/empty prompt.
git clone https://github.com/KAIST-VICLab/C-DiffSET.git
cd C-DiffSET
conda create -n cdiffset python=3.10 -y
conda activate cdiffset
# Install PyTorch matching your CUDA version first (see https://pytorch.org),
# then the remaining dependencies:
pip install -r requirements.txtPlease install the
torch/torchvisionbuild that matches your CUDA setup. The other packages are version-flexible;diffusersmust be recent enough to exposeregister_to_config.
Base model. The pretrained backbone is loaded from Manojb/stable-diffusion-2-1-base on the Hugging Face Hub (the original stabilityai/stable-diffusion-2-1-base repository is no longer available). It is downloaded automatically on first run. To use a different backbone, override pretrained_model_name_or_path in the config or via --pretrained-model-name-or-path.
| Dataset | Pol. | GSD | Link |
|---|---|---|---|
| QXS-SAROPT | single | 1 m | https://github.com/yaoxu008/QXS-SAROPT |
| SAR2Opt | single | 1 m | https://github.com/MarsZhaoYT/SAR2Opt-Heterogeneous-Dataset |
| SpaceNet6 | full | 0.5 m | https://spacenet.ai/sn6-challenge/ |
| Stellar-Vision | single | 0.5–1.2 m | proprietary (not publicly released) |
data/data_split.py writes train_eo_list*.txt/.pkl and test_eo_list*.txt/.pkl under ./data/<Dataset>_split/, which is exactly where the configs look for them:
python data/data_split.py --dataset spacenet --dataroot /path/to/SpaceNet6 --ratio 80
python data/data_split.py --dataset saropt --dataroot /path/to/sar2opt
python data/data_split.py --dataset qxs --dataroot /path/to/QXSLAB_SAROPT --ratio 80The feeders derive the SAR path from the EO path automatically (e.g. PS-RGB → SAR-Intensity for SpaceNet6, trainB → trainA for SAR2Opt), so only the EO list is stored.
Edit the dataroot and accelerator_path fields in configs/*.yaml (they currently contain /path/to/... placeholders).
accelerator_path in each config should point to a SAR-conditioned U-Net checkpoint (8-in / 4-out, i.e. C-DiffSET without the confidence channel). Training then adds the 5th variance channel on top of it.
# SpaceNet6
python main.py --config configs/spacenet_eps_conf.yaml
# SAR2Opt
python main.py --config configs/saropt_eps_conf.yaml
# QXS-SAROPT
python main.py --config configs/qxs_eps_conf.yamlAny argument in main.py can be overridden either through the YAML file or on the command line. Logs, checkpoints, and validation result grids are written to work_dir. Default settings follow the paper: AdamW, lr = 3e-5, weight decay 0.01, cosine schedule, 50,000 iterations, 100-step warmup.
Translate a folder of SAR .png images with a fine-tuned checkpoint:
python test.py \
--sar-dir /path/to/test/SAR \
--output-dir ./results/eo \
--conf-dir ./results/confidence \
--checkpoint /path/to/model.safetensors \
--num-inference-steps 50--conf-dir is optional; when set, the confidence map captured at the midpoint denoising step (t = T/2) is min-max normalized and saved alongside each EO output. Inference uses a DDIM scheduler; 50 steps is the quality/speed sweet spot reported in the paper (~3.4 s per 512×512 image on an A6000).
utils.py implements the reported metrics via torchmetrics, lpips, and pytorch-fid: FID, LPIPS, SCC, SSIM, PSNR (and Inception Score for large batches). Validation during training reports PSNR and keeps the best checkpoint.
| Type | Method | FID↓ (SAR2Opt / SpaceNet6) | LPIPS↓ | SSIM↑ | PSNR↑ |
|---|---|---|---|---|---|
| GAN | Pix2Pix | 196.87 / 124.55 | 0.426 / 0.256 | 0.216 / 0.522 | 15.42 / 19.36 |
| GAN | CycleGAN | 139.72 / 114.81 | 0.425 / 0.274 | 0.224 / 0.493 | 14.93 / 17.80 |
| GAN | StegoGAN | 144.54 / 75.12 | 0.398 / 0.244 | 0.237 / 0.516 | 15.62 / 18.96 |
| GAN | MT-GAN | 135.42 / 72.35 | 0.385 / 0.238 | 0.252 / 0.528 | 15.82 / 19.15 |
| DDPM | SF-Diff | 91.05 / 66.12 | 0.416 / 0.258 | 0.272 / 0.249 | 16.35 / 18.85 |
| LDM | BBDM | 94.72 / 81.86 | 0.473 / 0.302 | 0.234 / 0.217 | 15.13 / 17.68 |
| LDM | cBBDM | 97.64 / 72.77 | 0.394 / 0.243 | 0.285 / 0.254 | 16.59 / 19.03 |
| LDM | C-DiffSET (Ours) | 77.81 / 37.44 | 0.346 / 0.142 | 0.286 / 0.567 | 16.61 / 21.02 |
| Method | FID↓ | LPIPS↓ | SCC↑ | SSIM↑ | PSNR↑ |
|---|---|---|---|---|---|
| MT-GAN | 73.15 | 0.378 | 0.0021 | 0.295 | 15.98 |
| SF-Diff | 64.18 | 0.432 | 0.0021 | 0.299 | 16.11 |
| cBBDM | 69.47 | 0.420 | 0.0023 | 0.304 | 16.25 |
| C-DiffSET (Ours) | 18.15 | 0.293 | 0.0108 | 0.372 | 18.08 |
See the paper and the project page for the full comparison, the Stellar-Vision geo-disjoint evaluation, and all ablations (β, channel-mapping, text prompt, backbone).
C-DiffSET/
├── main.py # entry point (YAML + argparse)
├── train.py # Trainer: model setup, C-Diff loss, train/val loops
├── test.py # standalone inference
├── utils.py # metrics + logging/reporting
├── feeders/
│ ├── __init__.py
│ └── feeder.py # dataset classes (QXS, SAR2Opt, SpaceNet6, SEN1-2, Stellar)
├── data/
│ └── data_split.py # train/test split generator
├── configs/
│ ├── qxs_eps_conf.yaml
│ ├── saropt_eps_conf.yaml
│ └── spacenet_eps_conf.yaml
├── assets/ # figures for the README
├── requirements.txt
├── LICENSE
└── README.md
If you find this work useful, please cite:
@article{do2026cdiffset,
title = {C-DiffSET: Leveraging Latent Diffusion for SAR-to-EO Image Translation with Confidence-Guided Reliable Object Generation},
author = {Do, Jeonghyeok and Lee, Jaehyup and Lee, Seungchul and Kim, Munchurl},
journal = {IEEE Transactions on Circuits and Systems for Video Technology},
year = {2026},
doi = {10.1109/TCSVT.2026.3701447}
}This work builds on Stable Diffusion and the Hugging Face Diffusers library. The C-Diff loss is inspired by the β-NLL formulation of Seitzer et al. We thank the authors of the QXS-SAROPT, SAR2Opt, and SpaceNet6 datasets.
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean Government (MSIT) (RS-2024-00338513).
Released under the MIT License. Note that the Stellar-Vision dataset is proprietary and is not included in this release.