Skip to content

[Bug] CUDA Illegal Memory Access (CONCAT failed in ggml_cuda_compute_forward) When Using LoRAs With CPU Parameter Offloading #1558

@bitbite0

Description

@bitbite0

Git commit

N/A

Operating System & Version

Window 11

GGML backends

CUDA

Command-line arguments used

sd-cli --diffusion-model "Z:\gguf_models\Image\Qwen-Image\qwen-image-2512-Q4_K_M.gguf" --vae "Z:\gguf_models\Image\Qwen-Image\qwen_image_vae.safetensors" --llm "Z:\gguf_models\Image\Qwen-Image\Qwen2.5-VL-7B-Instruct-UD-Q4_K_XL.gguf" -v --diffusion-fa -W 720 -H 1024 --seed 42 --steps 8 --cfg-scale 1 --sampling-method euler --backend all=cuda0 --params-backend diffusion=cpu,te=cpu,vae=cpu --mmap --lora-model-dir "Z:\gguf_models\Image\Qwen-Image\LoRA" -p "a pack of pikachus in a lush forestlora:Qwen-Image-2512-Lightning-8steps-V1.0-bf16:1"

Steps to reproduce

  1. Load a quantized Qwen‑Image GGUF model (Q4_K_M, etc.).
  2. Load a Lightning LoRA via --lora-model-dir.
  3. Enable CPU parameter offloading:
    • --offload-to-cpu
      or
    • --params-backend diffusion=cpu,te=cpu,vae=cpu
  4. Start a generation request.

What you expected to happen

Generate an image

What actually happened

When applying a Lightning LoRA to a quantized Qwen‑Image GGUF base model, the system crashes at the first sampling step only when CPU parameter offloading is enabled (--offload-to-cpu or --params-backend diffusion=cpu).

The crash consistently reports:

ggml_cuda_compute_forward: CONCAT failed
CUDA error: an illegal memory access was encountered

This occurs in both sd-cli and sd-server.

It seems that the backend expects everything to be on one device but receives a mix of CPU and GPU tensors instead. LoRA is applied on GPU while the diffusion is running on the CPU, resulting in the model ending up with its data split across two different devices.

If I force --lora-apply-mode immediately, the system catches the mismatch earlier and reports a type‑combination error instead of crashing: unsupported type combination (f32 to q6_K)

Logs / error messages / stack trace

CLI (sd-cli)

[DEBUG] ggml_extend.hpp:1907 - qwen2.5vl compute buffer size: 9.10 MB(VRAM)
[INFO ] ggml_extend.hpp:2147 - qwen2.5vl offload params (5918.09 MB, 338 tensors) to runtime backend (CUDA0), taking 3.25s
[DEBUG] conditioner.hpp:2030 - computing condition graph completed, taking 3463 ms
[INFO ] stable-diffusion.cpp:3788 - get_learned_condition completed, taking 3.46s
[INFO ] stable-diffusion.cpp:4021 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:1907 - qwen_image compute buffer size: 444.05 MB(VRAM)
[INFO ] ggml_extend.hpp:2147 - qwen_image offload params (12631.04 MB, 1933 tensors) to runtime backend (CUDA0), taking 6.17s
[ERROR] ggml_extend.hpp:69   - ggml_cuda_compute_forward: CONCAT failed
[ERROR] ggml_extend.hpp:69   - CUDA error: an illegal memory access was encountered
[ERROR] ggml_extend.hpp:69   -   current device: 0, in function ggml_cuda_compute_forward at C:\gitrepo\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:3114
[ERROR] ggml_extend.hpp:69   -   err
C:\gitrepo\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:102: CUDA error

Server (sd-server)

[DEBUG] ggml_extend.hpp:1907 - qwen2.5vl compute buffer size: 9.10 MB(VRAM)
[INFO ] ggml_extend.hpp:2147 - qwen2.5vl offload params (5918.09 MB, 338 tensors) to runtime backend (CUDA0), taking 1.55s
[DEBUG] conditioner.hpp:2030 - computing condition graph completed, taking 1767 ms
[INFO ] stable-diffusion.cpp:3788 - get_learned_condition completed, taking 1.77s
[INFO ] stable-diffusion.cpp:4021 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:1907 - qwen_image compute buffer size: 444.05 MB(VRAM)
[INFO ] ggml_extend.hpp:2147 - qwen_image offload params (12631.04 MB, 1933 tensors) to runtime backend (CUDA0), taking 4.48s
[ERROR] ggml_extend.hpp:69   - ggml_cuda_compute_forward: CONCAT failed
[ERROR] ggml_extend.hpp:69   - CUDA error: an illegal memory access was encountered
[ERROR] ggml_extend.hpp:69   -   current device: 0, in function ggml_cuda_compute_forward at C:\gitrepo\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:3114
[ERROR] ggml_extend.hpp:69   -   err
C:\gitrepo\stable-diffusion.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:102: CUDA error

Additional context / environment details

System Configuration

  • GPU: NVIDIA GeForce RTX 4070 Ti SUPER (16 GB VRAM)
  • Build: master‑633‑5b0267e (commit 5b0267e)
  • Core Models: Qwen-Image-2512 (Q4_K_M GGUF base + Qwen2.5-VL-7B Text Encoder)
  • Lora Model: LightX2V-Qwen-Image-Lightning

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions