June 22, 2026
Can you run a 35-billion-parameter model on a GPU with only 6 GiB of VRAM at usable speed? Yes. This guide walks through exactly how: using llama.cpp, Qwen 3.6 35B A3B, and five specific flags to hit ~17 tokens/sec on an 8-year-old GTX 1060.
Hardware (worst-case baseline):
Software:
This rig is a floor, not a ceiling. If your hardware is newer, your numbers will be better.
Most beginners split by layers: some layers on GPU, rest on CPU. That fails for MoE because every layer carries all its experts with it. With mixture-of-experts, only a handful of experts wake per token, so the smart split is different.
Flag:
--n-cpu-moe 41
This pins all expert weights to CPU while sending everything else to the GPU. Result: speed jumps from ~3 tok/s to ~10 tok/s β a 230% improvement with no hardware change.
By default, llama.cpp mmap's the entire model file. The OS pages chunks in on demand. That sounds smart, but during inference the model frequently requests an expert that hasn't been loaded yet, causing disk reads and mid-token page faults.
Fix: load the full 20 GB model into RAM upfront with --no-mmap. No more disk reads, no more page faults. Speed goes from ~10 tok/s to ~13.5 tok/s β roughly another 35% bump from one flag.
At ~13.5 tok/s, the GPU still had ~2 GiB free. Move more layers onto the GPU:
--n-cpu-moe 35
Pulling six layers' worth of experts back to the GPU pushes VRAM from 4 GiB to 5.5 GiB and speeds inference to ~17 tok/s. The trade-off: context window drops from 100,000 to ~64,000 tokens. That's fine for chats, but tight for a whole codebase.
Context is expensive because the KV cache grows linearly with sequence length. Even with Q8 quantization (effectively lossless), doubling context doubles memory.
Google DeepMind's Turbo Quant paper showed that 4-bit keys and 3-bit values can be nearly lossless, especially for grouped-query-attention models like this one. Use two flags:
--cache-type-k q4_0 --cache-type-v q3_0
Bump context from 64,000 to 128,000 β it loads, VRAM at 5.3 GiB. Then try 256,000 by moving one more expert layer to CPU (--n-cpu-moe 36). It fits: 5.9 of 6 GiB used, same 17 tok/s speed, because the compressed cache lookup is essentially free.
Practical benefit: you can paste a small book or an entire codebase into context without the model forgetting page one by page 50.
Without --mlock, the kernel treats all those expert weights in RAM as regular files. Over hours or days, under memory pressure or during idle, the OS pages some out to disk. Next inference hits a page fault: stutter, random slow tokens, gradual degradation.
Fix requires three things:
mlock permission--cap-add IPC_LOCK--mlockAfter all three are in place:
--mlock
Check /proc/<pid>/status for mlocked β in this run it locked ~16 GB. Same 17 tok/s speed, but now the setup survives days instead of degrading overnight.
Combining all five flags inside Docker:
# Example launch (adjust paths/ports to your environment)
docker run --gpus all --cap-add IPC_LOCK \
-v /path/to/models:/models \
ghcr.io/ggerganov/llama.cpp:latest \
./llama-server \
-m /models/Qwen3.6-35B-A3B-Q4_K_M.gguf \
--n-cpu-moe 36 \
--no-mmap \
--cache-type-k q4_0 \
--cache-type-v q3_0 \
--mlock \
-c 256000 \
--port 8080
Result: 35 billion parameters, 6 GiB VRAM, 256,000-token context, ~17 tokens/sec, stable over days.
Speculative decoding runs a small drafter model that guesses the next several tokens; the large model verifies the batch. On dense transformers, this is 2-4x faster. Here, it made things slower.
MoE kills batching. Each token in the batch picks its own 8 experts out of 256. Eight tokens batched together can touch 64 different experts per layer. The verification step stops being a nice batch and turns into memory thrashing across PCIe.
SSM layers break parallelism. 30 of 40 layers are state-space-model layers, which compute one position at a time based on the previous state. You cannot parallelize the draft-window verification cleanly.
Net result: the draft acceptance rate was decent (~65%), but overall speed dropped from 17 tok/s to 11 tok/s. Someone independently benchmarked this on a 3090 across 19 configurations and found the same result: speculative decoding works for transformers, not for this MoE+SSM combo.
--n-cpu-moe, --no-mmap, reducing --n-cpu-moe a bit for speed, Turbo Quant KV cache, and --mlock for long-running stability.