Skip to content

KAIST-VICLab/C-DiffSET

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

C-DiffSET: Leveraging Latent Diffusion for SAR-to-EO Image Translation with Confidence-Guided Reliable Object Generation

Official PyTorch implementation of the paper "C-DiffSET: Leveraging Latent Diffusion for SAR-to-EO Image Translation with Confidence-Guided Reliable Object Generation", accepted to IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2026.

Jeonghyeok Do1, Jaehyup Lee2, Seungchul Lee3, and Munchurl Kim1,†

1 School of Electrical Engineering, KAIST, Daejeon, Korea    2 School of Computer Science and Engineering, Kyungpook National University, Daegu, Korea    3 Stellarvision Inc., Daejeon, Korea

Corresponding author

Paper arXiv Project Page


📰 News

  • Jul 1, 2026: Code and pretrained models are released. 🎉
  • Jun 1, 2026: C-DiffSET is accepted to IEEE TCSVT. 🎉
  • Dec 9, 2024: This repository is created.

Overview

SAR imagery works in all weather and lighting conditions but is hard to interpret because of speckle noise and its unusual structure. C-DiffSET translates SAR into EO (electro-optical) imagery by fine-tuning a pretrained Latent Diffusion Model (Stable Diffusion v2.1) rather than training from scratch, which mitigates overfitting on the small paired SAR–EO datasets.

Two ideas make this work:

  1. Shared latent space via a frozen VAE. SAR and EO images are embedded by the same pretrained VAE encoder. The SAR latent is concatenated (channel-wise) with the noisy EO latent and used to condition the denoising U-Net, which preserves pixel-wise correspondence.
  2. Confidence-guided diffusion (C-Diff) loss. The U-Net additionally predicts a per-pixel variance. A β-NLL–style loss down-weights uncertain regions, so temporal discrepancies (objects that appear/disappear between the SAR and EO acquisitions) are not aggressively hallucinated.

C-DiffSET reaches state-of-the-art results on QXS-SAROPT, SAR2Opt, SpaceNet6, and a proprietary Stellar-Vision dataset.


Overview of the proposed C-DiffSET framework. (a) Training: a frozen VAE encodes the SAR image X and EO image Y into a shared latent space; noise is added to the EO latent in the forward process, and the denoising U-Net predicts both the noise ε̂t and a confidence map ĉt conditioned on the SAR latent zx, optimized by the C-Diff loss. (b) Inference: an EO image Ŷ is generated from a SAR input by iteratively denoising a random latent ẑyT.


Architecture in a nutshell

The pretrained SD 2.1 U-Net is modified in two places (train.py):

Layer Original Modified Purpose
conv_in 4 ch 8 ch concat [SAR latent | noisy EO latent]; original weights duplicated and scaled by 0.5
conv_out 4 ch 5 ch 4 channels for predicted noise + 1 channel for raw variance (zero-initialized)

The VAE and CLIP text encoder are frozen; only the U-Net is fine-tuned. Following the paper, a fixed prompt ("electro-optical image") is used as a semantic anchor at both training and inference time, instead of a null/empty prompt.


Installation

git clone https://github.com/KAIST-VICLab/C-DiffSET.git
cd C-DiffSET

conda create -n cdiffset python=3.10 -y
conda activate cdiffset

# Install PyTorch matching your CUDA version first (see https://pytorch.org),
# then the remaining dependencies:
pip install -r requirements.txt

Please install the torch / torchvision build that matches your CUDA setup. The other packages are version-flexible; diffusers must be recent enough to expose register_to_config.

Base model. The pretrained backbone is loaded from Manojb/stable-diffusion-2-1-base on the Hugging Face Hub (the original stabilityai/stable-diffusion-2-1-base repository is no longer available). It is downloaded automatically on first run. To use a different backbone, override pretrained_model_name_or_path in the config or via --pretrained-model-name-or-path.


Data preparation

1. Download the datasets

Dataset Pol. GSD Link
QXS-SAROPT single 1 m https://github.com/yaoxu008/QXS-SAROPT
SAR2Opt single 1 m https://github.com/MarsZhaoYT/SAR2Opt-Heterogeneous-Dataset
SpaceNet6 full 0.5 m https://spacenet.ai/sn6-challenge/
Stellar-Vision single 0.5–1.2 m proprietary (not publicly released)

2. Generate the split lists

data/data_split.py writes train_eo_list*.txt/.pkl and test_eo_list*.txt/.pkl under ./data/<Dataset>_split/, which is exactly where the configs look for them:

python data/data_split.py --dataset spacenet --dataroot /path/to/SpaceNet6 --ratio 80
python data/data_split.py --dataset saropt   --dataroot /path/to/sar2opt
python data/data_split.py --dataset qxs      --dataroot /path/to/QXSLAB_SAROPT --ratio 80

The feeders derive the SAR path from the EO path automatically (e.g. PS-RGB → SAR-Intensity for SpaceNet6, trainB → trainA for SAR2Opt), so only the EO list is stored.

3. Point the configs at your data

Edit the dataroot and accelerator_path fields in configs/*.yaml (they currently contain /path/to/... placeholders).


Pretrained models

accelerator_path in each config should point to a SAR-conditioned U-Net checkpoint (8-in / 4-out, i.e. C-DiffSET without the confidence channel). Training then adds the 5th variance channel on top of it.


Training

# SpaceNet6
python main.py --config configs/spacenet_eps_conf.yaml

# SAR2Opt
python main.py --config configs/saropt_eps_conf.yaml

# QXS-SAROPT
python main.py --config configs/qxs_eps_conf.yaml

Any argument in main.py can be overridden either through the YAML file or on the command line. Logs, checkpoints, and validation result grids are written to work_dir. Default settings follow the paper: AdamW, lr = 3e-5, weight decay 0.01, cosine schedule, 50,000 iterations, 100-step warmup.


Inference

Translate a folder of SAR .png images with a fine-tuned checkpoint:

python test.py \
    --sar-dir     /path/to/test/SAR \
    --output-dir  ./results/eo \
    --conf-dir    ./results/confidence \
    --checkpoint  /path/to/model.safetensors \
    --num-inference-steps 50

--conf-dir is optional; when set, the confidence map captured at the midpoint denoising step (t = T/2) is min-max normalized and saved alongside each EO output. Inference uses a DDIM scheduler; 50 steps is the quality/speed sweet spot reported in the paper (~3.4 s per 512×512 image on an A6000).


Evaluation metrics

utils.py implements the reported metrics via torchmetrics, lpips, and pytorch-fid: FID, LPIPS, SCC, SSIM, PSNR (and Inception Score for large batches). Validation during training reports PSNR and keeps the best checkpoint.


Results

SAR2Opt & SpaceNet6 (Table I)

Type Method FID↓ (SAR2Opt / SpaceNet6) LPIPS↓ SSIM↑ PSNR↑
GAN Pix2Pix 196.87 / 124.55 0.426 / 0.256 0.216 / 0.522 15.42 / 19.36
GAN CycleGAN 139.72 / 114.81 0.425 / 0.274 0.224 / 0.493 14.93 / 17.80
GAN StegoGAN 144.54 / 75.12 0.398 / 0.244 0.237 / 0.516 15.62 / 18.96
GAN MT-GAN 135.42 / 72.35 0.385 / 0.238 0.252 / 0.528 15.82 / 19.15
DDPM SF-Diff 91.05 / 66.12 0.416 / 0.258 0.272 / 0.249 16.35 / 18.85
LDM BBDM 94.72 / 81.86 0.473 / 0.302 0.234 / 0.217 15.13 / 17.68
LDM cBBDM 97.64 / 72.77 0.394 / 0.243 0.285 / 0.254 16.59 / 19.03
LDM C-DiffSET (Ours) 77.81 / 37.44 0.346 / 0.142 0.286 / 0.567 16.61 / 21.02

QXS-SAROPT (Table II)

Method FID↓ LPIPS↓ SCC↑ SSIM↑ PSNR↑
MT-GAN 73.15 0.378 0.0021 0.295 15.98
SF-Diff 64.18 0.432 0.0021 0.299 16.11
cBBDM 69.47 0.420 0.0023 0.304 16.25
C-DiffSET (Ours) 18.15 0.293 0.0108 0.372 18.08

See the paper and the project page for the full comparison, the Stellar-Vision geo-disjoint evaluation, and all ablations (β, channel-mapping, text prompt, backbone).


Repository structure

C-DiffSET/
├── main.py                 # entry point (YAML + argparse)
├── train.py                # Trainer: model setup, C-Diff loss, train/val loops
├── test.py                 # standalone inference
├── utils.py                # metrics + logging/reporting
├── feeders/
│   ├── __init__.py
│   └── feeder.py           # dataset classes (QXS, SAR2Opt, SpaceNet6, SEN1-2, Stellar)
├── data/
│   └── data_split.py       # train/test split generator
├── configs/
│   ├── qxs_eps_conf.yaml
│   ├── saropt_eps_conf.yaml
│   └── spacenet_eps_conf.yaml
├── assets/                 # figures for the README
├── requirements.txt
├── LICENSE
└── README.md

Citation

If you find this work useful, please cite:

@article{do2026cdiffset,
  title   = {C-DiffSET: Leveraging Latent Diffusion for SAR-to-EO Image Translation with Confidence-Guided Reliable Object Generation},
  author  = {Do, Jeonghyeok and Lee, Jaehyup and Lee, Seungchul and Kim, Munchurl},
  journal = {IEEE Transactions on Circuits and Systems for Video Technology},
  year    = {2026},
  doi     = {10.1109/TCSVT.2026.3701447}
}

Acknowledgements

This work builds on Stable Diffusion and the Hugging Face Diffusers library. The C-Diff loss is inspired by the β-NLL formulation of Seitzer et al. We thank the authors of the QXS-SAROPT, SAR2Opt, and SpaceNet6 datasets.

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean Government (MSIT) (RS-2024-00338513).

License

Released under the MIT License. Note that the Stellar-Vision dataset is proprietary and is not included in this release.

About

[IEEE TCSVT 2026] Official repository of C-DiffSET

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages