Running 35B AI on a 6GB GPU: 5 Flags That Actually Matter

Can you run a 35-billion-parameter model on a GPU with only 6 GiB of VRAM at usable speed? Yes. This guide walks through exactly how: using llama.cpp, Qwen 3.6 35B A3B, and five specific flags to hit ~17 tokens/sec on an 8-year-old GTX 1060.

The Setup

Hardware (worst-case baseline):

GPU: GTX 1060 (6 GB VRAM, PCIe Gen3)
CPU: Intel Core i3-8100 (4 cores, no hyper-threading)
RAM: 24 GB DDR4

Software:

Engine: llama.cpp
Model: Qwen 3.6 35B A3B (MoE, 35B total / 3B active per token)

This rig is a floor, not a ceiling. If your hardware is newer, your numbers will be better.

Trick 1: Split Experts, Not Layers

Most beginners split by layers: some layers on GPU, rest on CPU. That fails for MoE because every layer carries all its experts with it. With mixture-of-experts, only a handful of experts wake per token, so the smart split is different.

Keep the small, fast-firing shared parts on the GPU.
Push the huge sleeping expert blocks into RAM/CPU.

Flag:

--n-cpu-moe 41

This pins all expert weights to CPU while sending everything else to the GPU. Result: speed jumps from ~3 tok/s to ~10 tok/s — a 230% improvement with no hardware change.

Trick 2: Disable mmap for Predictable RAM Access

By default, llama.cpp mmap's the entire model file. The OS pages chunks in on demand. That sounds smart, but during inference the model frequently requests an expert that hasn't been loaded yet, causing disk reads and mid-token page faults.

Fix: load the full 20 GB model into RAM upfront with --no-mmap. No more disk reads, no more page faults. Speed goes from ~10 tok/s to ~13.5 tok/s — roughly another 35% bump from one flag.

Trick 3: Reclaim VRAM for Context

At ~13.5 tok/s, the GPU still had ~2 GiB free. Move more layers onto the GPU:

--n-cpu-moe 35

Pulling six layers' worth of experts back to the GPU pushes VRAM from 4 GiB to 5.5 GiB and speeds inference to ~17 tok/s. The trade-off: context window drops from 100,000 to ~64,000 tokens. That's fine for chats, but tight for a whole codebase.

Trick 4: Turbo Quant the KV Cache

Context is expensive because the KV cache grows linearly with sequence length. Even with Q8 quantization (effectively lossless), doubling context doubles memory.

Google DeepMind's Turbo Quant paper showed that 4-bit keys and 3-bit values can be nearly lossless, especially for grouped-query-attention models like this one. Use two flags:

--cache-type-k q4_0 --cache-type-v q3_0

Bump context from 64,000 to 128,000 — it loads, VRAM at 5.3 GiB. Then try 256,000 by moving one more expert layer to CPU (--n-cpu-moe 36). It fits: 5.9 of 6 GiB used, same 17 tok/s speed, because the compressed cache lookup is essentially free.

Practical benefit: you can paste a small book or an entire codebase into context without the model forgetting page one by page 50.

Trick 5: Lock Memory with mlock

Without --mlock, the kernel treats all those expert weights in RAM as regular files. Over hours or days, under memory pressure or during idle, the OS pages some out to disk. Next inference hits a page fault: stutter, random slow tokens, gradual degradation.

Fix requires three things:

LXC container has mlock permission
Docker has --cap-add IPC_LOCK
llama.cpp uses --mlock

After all three are in place:

--mlock

Check /proc/<pid>/status for mlocked — in this run it locked ~16 GB. Same 17 tok/s speed, but now the setup survives days instead of degrading overnight.

Final Command

Combining all five flags inside Docker:

# Example launch (adjust paths/ports to your environment)
docker run --gpus all --cap-add IPC_LOCK \
  -v /path/to/models:/models \
  ghcr.io/ggerganov/llama.cpp:latest \
  ./llama-server \
    -m /models/Qwen3.6-35B-A3B-Q4_K_M.gguf \
    --n-cpu-moe 36 \
    --no-mmap \
    --cache-type-k q4_0 \
    --cache-type-v q3_0 \
    --mlock \
    -c 256000 \
    --port 8080

Result: 35 billion parameters, 6 GiB VRAM, 256,000-token context, ~17 tokens/sec, stable over days.

What Didn't Work: Speculative Decoding

Speculative decoding runs a small drafter model that guesses the next several tokens; the large model verifies the batch. On dense transformers, this is 2-4x faster. Here, it made things slower.

Why

MoE kills batching. Each token in the batch picks its own 8 experts out of 256. Eight tokens batched together can touch 64 different experts per layer. The verification step stops being a nice batch and turns into memory thrashing across PCIe.
SSM layers break parallelism. 30 of 40 layers are state-space-model layers, which compute one position at a time based on the previous state. You cannot parallelize the draft-window verification cleanly.

Net result: the draft acceptance rate was decent (~65%), but overall speed dropped from 17 tok/s to 11 tok/s. Someone independently benchmarked this on a 3090 across 19 configurations and found the same result: speculative decoding works for transformers, not for this MoE+SSM combo.

Key Takeaways

MoE changes the splitting logic. With mixture-of-experts, put the giant sleeping expert blocks in RAM and keep the thin active layers on the GPU.
Five flags fixed the baseline: --n-cpu-moe, --no-mmap, reducing --n-cpu-moe a bit for speed, Turbo Quant KV cache, and --mlock for long-running stability.
This is the floor, not the ceiling. The hardware here is eight years old. Newer GPUs, PCIe Gen4, and faster RAM will push these numbers higher.